DonorsChoose

DonorsChoose.org receives hundreds of thousands of project proposals each year for classroom projects in need of funding. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the DonorsChoose.org website.

Next year, DonorsChoose.org expects to receive close to 500,000 project proposals. As a result, there are three main problems they need to solve:

  • How to scale current manual processes and resources to screen 500,000 projects so that they can be posted as quickly and as efficiently as possible
  • How to increase the consistency of project vetting across different volunteers to improve the experience for teachers
  • How to focus volunteer time on the applications that need the most assistance

The goal of the competition is to predict whether or not a DonorsChoose.org project proposal submitted by a teacher will be approved, using the text of project descriptions as well as additional metadata about the project, teacher, and school. DonorsChoose.org can then use this information to identify projects most likely to need further review before approval.

About the DonorsChoose Data Set

The train.csv data set provided by DonorsChoose contains the following features:

Feature Description
project_id A unique identifier for the proposed project. Example: p036502
project_title Title of the project. Examples:
  • Art Will Make You Happy!
  • First Grade Fun
project_grade_category Grade level of students for which the project is targeted. One of the following enumerated values:
  • Grades PreK-2
  • Grades 3-5
  • Grades 6-8
  • Grades 9-12
project_subject_categories One or more (comma-separated) subject categories for the project from the following enumerated list of values:
  • Applied Learning
  • Care & Hunger
  • Health & Sports
  • History & Civics
  • Literacy & Language
  • Math & Science
  • Music & The Arts
  • Special Needs
  • Warmth

Examples:
  • Music & The Arts
  • Literacy & Language, Math & Science
school_state State where school is located (Two-letter U.S. postal code). Example: WY
project_subject_subcategories One or more (comma-separated) subject subcategories for the project. Examples:
  • Literacy
  • Literature & Writing, Social Sciences
project_resource_summary An explanation of the resources needed for the project. Example:
  • My students need hands on literacy materials to manage sensory needs!
project_essay_1 First application essay*
project_essay_2 Second application essay*
project_essay_3 Third application essay*
project_essay_4 Fourth application essay*
project_submitted_datetime Datetime when project application was submitted. Example: 2016-04-28 12:43:56.245
teacher_id A unique identifier for the teacher of the proposed project. Example: bdf8baa8fedef6bfeec7ae4ff1c15c56
teacher_prefix Teacher's title. One of the following enumerated values:
  • nan
  • Dr.
  • Mr.
  • Mrs.
  • Ms.
  • Teacher.
teacher_number_of_previously_posted_projects Number of project applications previously submitted by the same teacher. Example: 2

* See the section Notes on the Essay Data for more details about these features.

Additionally, the resources.csv data set provides more data about the resources required for each project. Each line in this file represents a resource required by a project:

Feature Description
id A project_id value from the train.csv file. Example: p036502
description Desciption of the resource. Example: Tenor Saxophone Reeds, Box of 25
quantity Quantity of the resource required. Example: 3
price Price of the resource required. Example: 9.95

Note: Many projects require multiple resources. The id value corresponds to a project_id in train.csv, so you use it as a key to retrieve all resources needed for a project:

The data set contains the following label (the value you will attempt to predict):

Label Description
project_is_approved A binary flag indicating whether DonorsChoose approved the project. A value of 0 indicates the project was not approved, and a value of 1 indicates the project was approved.

Notes on the Essay Data

    Prior to May 17, 2016, the prompts for the essays were as follows:
  • __project_essay_1:__ "Introduce us to your classroom"
  • __project_essay_2:__ "Tell us more about your students"
  • __project_essay_3:__ "Describe how your students will use the materials you're requesting"
  • __project_essay_3:__ "Close by sharing why your project will make a difference"
    Starting on May 17, 2016, the number of essays was reduced from 4 to 2, and the prompts for the first 2 essays were changed to the following:
  • __project_essay_1:__ "Describe your students: What makes your students special? Specific details about their background, your neighborhood, and your school are all helpful."
  • __project_essay_2:__ "About your project: How will these materials make a difference in your students' learning and improve their school lives?"

  • For all projects with project_submitted_datetime of 2016-05-17 and later, the values of project_essay_3 and project_essay_4 will be NaN.
In [1]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os

from plotly import plotly
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
from collections import Counter

1.1 Reading Data

In [2]:
project_data = pd.read_csv('train_data.csv')
resource_data = pd.read_csv('resources.csv')
In [3]:
print("Number of data points in train data", project_data.shape)
print('-'*50)
print("The attributes of data :", project_data.columns.values)
Number of data points in train data (109248, 17)
--------------------------------------------------
The attributes of data : ['Unnamed: 0' 'id' 'teacher_id' 'teacher_prefix' 'school_state'
 'project_submitted_datetime' 'project_grade_category'
 'project_subject_categories' 'project_subject_subcategories'
 'project_title' 'project_essay_1' 'project_essay_2' 'project_essay_3'
 'project_essay_4' 'project_resource_summary'
 'teacher_number_of_previously_posted_projects' 'project_is_approved']
In [4]:
labels=project_data['project_is_approved']
project_data.drop(['project_is_approved'],axis=1,inplace=True)
In [5]:
labels=labels.head(50000)
In [6]:
project_data=project_data[0:50000]
In [7]:
project_data.head(1)
Out[7]:
Unnamed: 0 id teacher_id teacher_prefix school_state project_submitted_datetime project_grade_category project_subject_categories project_subject_subcategories project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects
0 160221 p253737 c90749f5d961ff158d4b4d1e7dc665fc Mrs. IN 2016-12-05 13:43:57 Grades PreK-2 Literacy & Language ESL, Literacy Educational Support for English Learners at Home My students are English learners that are work... \"The limits of your language are the limits o... NaN NaN My students need opportunities to practice beg... 0

Stratified Sampling: Splitting data into Train and Test

In [8]:
from sklearn.model_selection import train_test_split
project_data_train, project_data_test, labels_train, labels_test = train_test_split(project_data, labels , test_size=0.33, stratify=labels)
print(project_data_train.shape)
print(project_data_test.shape)
print(labels_train.shape)
print(labels_test.shape)
(33500, 16)
(16500, 16)
(33500,)
(16500,)
In [9]:
project_data_train.head(2)
Out[9]:
Unnamed: 0 id teacher_id teacher_prefix school_state project_submitted_datetime project_grade_category project_subject_categories project_subject_subcategories project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects
1361 43445 p097740 0d153ad81f03058ce80c0c3c697b77b5 Teacher CA 2017-03-31 00:34:13 Grades 6-8 Math & Science Applied Sciences, Mathematics A Mindstorm is Brewing When I tell people what I teach, the response ... LEGO Mindstorms will give my students an exper... NaN NaN My students need access to meaningful applicat... 1
45965 175861 p153488 363788b51d40d978fe276bcb1f8a2b35 Mrs. CA 2017-03-30 22:46:14 Grades 3-5 Literacy & Language, Math & Science Literature & Writing, Mathematics Color Our Academic World \"All kids need is a little help, a little hop... Collaboration is our middle name! My kids work... NaN NaN My students need markers to be able to work in... 47
In [10]:
labels=list(labels_train)
In [11]:
ids=list(project_data_train['id'])
In [12]:
data={'labels':labels, 'id':ids}

df=pd.DataFrame(data)

print(df.head(2))
   labels       id
0       1  p097740
1       1  p153488
In [13]:
project_data_train = pd.merge(project_data_train, df, on='id', how='left').reset_index()

project_data_train.head(2)
Out[13]:
index Unnamed: 0 id teacher_id teacher_prefix school_state project_submitted_datetime project_grade_category project_subject_categories project_subject_subcategories project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects labels
0 0 43445 p097740 0d153ad81f03058ce80c0c3c697b77b5 Teacher CA 2017-03-31 00:34:13 Grades 6-8 Math & Science Applied Sciences, Mathematics A Mindstorm is Brewing When I tell people what I teach, the response ... LEGO Mindstorms will give my students an exper... NaN NaN My students need access to meaningful applicat... 1 1
1 1 175861 p153488 363788b51d40d978fe276bcb1f8a2b35 Mrs. CA 2017-03-30 22:46:14 Grades 3-5 Literacy & Language, Math & Science Literature & Writing, Mathematics Color Our Academic World \"All kids need is a little help, a little hop... Collaboration is our middle name! My kids work... NaN NaN My students need markers to be able to work in... 47 1
In [14]:
project_data_train.drop(['Unnamed: 0','index'],axis=1,inplace=True)
In [15]:
project_data_train.head(2)
Out[15]:
id teacher_id teacher_prefix school_state project_submitted_datetime project_grade_category project_subject_categories project_subject_subcategories project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects labels
0 p097740 0d153ad81f03058ce80c0c3c697b77b5 Teacher CA 2017-03-31 00:34:13 Grades 6-8 Math & Science Applied Sciences, Mathematics A Mindstorm is Brewing When I tell people what I teach, the response ... LEGO Mindstorms will give my students an exper... NaN NaN My students need access to meaningful applicat... 1 1
1 p153488 363788b51d40d978fe276bcb1f8a2b35 Mrs. CA 2017-03-30 22:46:14 Grades 3-5 Literacy & Language, Math & Science Literature & Writing, Mathematics Color Our Academic World \"All kids need is a little help, a little hop... Collaboration is our middle name! My kids work... NaN NaN My students need markers to be able to work in... 47 1

preprocessing of project_subject_categories - Train Data

In [16]:
catogories = (project_data_train['project_subject_categories'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039

# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python
cat_list = []
for i in catogories:
    temp = ""
    # consider we have text like this "Math & Science, Warmth, Care & Hunger"
    for j in i.split(','): # it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
        if 'The' in j.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
        temp+=j.strip()+" " #" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_') # we are replacing the & value into 
    cat_list.append(temp.strip())
    
project_data_train['clean_categories'] = cat_list
project_data_train.drop(['project_subject_categories'], axis=1, inplace=True)
In [17]:
unique_list = []
for x in cat_list:
    if x not in unique_list: 
            unique_list.append(x)
    
    
#print(unique_list)

categories=pd.DataFrame({'clean_categories': unique_list})
categories=categories.sort_values(['clean_categories'], ascending=True).reset_index()

print(categories.head(2))
   index               clean_categories
0      8                AppliedLearning
1     28  AppliedLearning Health_Sports
In [18]:
df1=project_data_train[['clean_categories','labels']][(project_data_train['labels']==1)]

print(df1.head(2))
                 clean_categories  labels
0                    Math_Science       1
1  Literacy_Language Math_Science       1
In [19]:
df2=project_data_train[['clean_categories','labels']][(project_data_train['labels']==0)]
In [20]:
z =df1.groupby(['clean_categories'])['labels'].value_counts() /project_data_train.groupby(['clean_categories'])['labels'].count()

group_1=pd.DataFrame(z)

group_1=group_1.reset_index(drop=True)
print(group_1.head(2))
     labels
0  0.816926
1  0.830601
In [21]:
z1 =df2.groupby(['clean_categories'])['labels'].value_counts() /project_data_train.groupby(['clean_categories'])['labels'].count()

group_0=pd.DataFrame(z1)

group_0=group_0.reset_index(drop=True)
print(group_0.head(2))
     labels
0  0.183074
1  0.169399
In [22]:
x1= df1.groupby(['clean_categories'])['labels'].value_counts() 
class_1=pd.DataFrame(x1)
class_1=class_1.reset_index(drop=True)
print ( class_1.head(2))
   labels
0     946
1     152
In [23]:
x0= df2.groupby(['clean_categories'])['labels'].value_counts() 
class_0=pd.DataFrame(x0)
class_0=class_0.reset_index(drop=True)
print ( class_0.head(2))
   labels
0     212
1      31
In [24]:
Response_Table = pd.concat([categories, class_0, class_1],axis=1)
In [25]:
#taken from https://stackoverflow.com/questions/24685012/pandas-dataframe-renaming-multiple-identically-named-columns

def df_column_uniquify(df):
    df_columns = df.columns
    new_columns = []
    for item in df_columns:
        counter = 0
        newitem = item
        while newitem in new_columns:
            counter += 1
            newitem = "{}_{}".format(item, counter)
        new_columns.append(newitem)
    df.columns = new_columns
    return df
In [26]:
Response_Table = df_column_uniquify(Response_Table)
In [27]:
Response_Table.rename(columns={'labels':'Class=0','labels_1':'Class=1'},inplace=True)

print("Response Table for Categories")

Response_Table
Response Table for Categories
Out[27]:
index clean_categories Class=0 Class=1
0 8 AppliedLearning 212.0 946
1 28 AppliedLearning Health_Sports 31.0 152
2 30 AppliedLearning History_Civics 12.0 48
3 16 AppliedLearning Literacy_Language 106.0 586
4 19 AppliedLearning Math_Science 63.0 249
5 9 AppliedLearning Music_Arts 45.0 194
6 27 AppliedLearning SpecialNeeds 93.0 366
7 40 AppliedLearning Warmth Care_Hunger 473.0 3
8 5 Health_Sports 12.0 2660
9 21 Health_Sports AppliedLearning 1.0 49
10 38 Health_Sports History_Civics 45.0 15
11 25 Health_Sports Literacy_Language 17.0 221
12 32 Health_Sports Math_Science 14.0 56
13 29 Health_Sports Music_Arts 61.0 35
14 6 Health_Sports SpecialNeeds 104.0 378
15 42 Health_Sports Warmth Care_Hunger 3.0 8
16 11 History_Civics 41.0 473
17 34 History_Civics AppliedLearning 17.0 14
18 22 History_Civics Health_Sports 15.0 7
19 23 History_Civics Literacy_Language 14.0 414
20 26 History_Civics Math_Science 990.0 102
21 12 History_Civics Music_Arts 33.0 79
22 36 History_Civics SpecialNeeds 4.0 62
23 4 Literacy_Language 28.0 6320
24 17 Literacy_Language AppliedLearning 594.0 162
25 43 Literacy_Language Health_Sports 98.0 13
26 31 Literacy_Language History_Civics 173.0 205
27 1 Literacy_Language Math_Science 941.0 3878
28 24 Literacy_Language Music_Arts 58.0 425
29 13 Literacy_Language SpecialNeeds 23.0 1043
30 46 Literacy_Language Warmth Care_Hunger 33.0 2
31 0 Math_Science 99.0 4180
32 18 Math_Science AppliedLearning 89.0 326
33 35 Math_Science Health_Sports 107.0 95
34 33 Math_Science History_Civics 1.0 147
35 20 Math_Science Literacy_Language 230.0 596
36 15 Math_Science Music_Arts 1.0 429
37 2 Math_Science SpecialNeeds 2.0 474
38 41 Math_Science Warmth Care_Hunger 3.0 2
39 14 Music_Arts 4.0 1339
40 44 Music_Arts AppliedLearning 231.0 1
41 39 Music_Arts Health_Sports 3.0 5
42 47 Music_Arts History_Civics 19.0 4
43 37 Music_Arts SpecialNeeds 25.0 42
44 3 SpecialNeeds NaN 1037
45 45 SpecialNeeds Health_Sports NaN 6
46 10 SpecialNeeds Music_Arts NaN 83
47 48 SpecialNeeds Warmth Care_Hunger NaN 5
48 7 Warmth Care_Hunger NaN 396
In [28]:
category_1 = pd.concat([categories,group_0,group_1],axis=1).reset_index()
category_1
Out[28]:
level_0 index clean_categories labels labels
0 0 8 AppliedLearning 0.183074 0.816926
1 1 28 AppliedLearning Health_Sports 0.169399 0.830601
2 2 30 AppliedLearning History_Civics 0.200000 0.800000
3 3 16 AppliedLearning Literacy_Language 0.153179 0.846821
4 4 19 AppliedLearning Math_Science 0.201923 0.798077
5 5 9 AppliedLearning Music_Arts 0.188285 0.811715
6 6 27 AppliedLearning SpecialNeeds 0.202614 0.797386
7 7 40 AppliedLearning Warmth Care_Hunger 0.150974 1.000000
8 8 5 Health_Sports 0.196721 0.849026
9 9 21 Health_Sports AppliedLearning 0.062500 0.803279
10 10 38 Health_Sports History_Civics 0.169173 0.937500
11 11 25 Health_Sports Literacy_Language 0.232877 0.830827
12 12 32 Health_Sports Math_Science 0.285714 0.767123
13 13 29 Health_Sports Music_Arts 0.138952 0.714286
14 14 6 Health_Sports SpecialNeeds 0.180243 0.861048
15 15 42 Health_Sports Warmth Care_Hunger 0.176471 1.000000
16 16 11 History_Civics 0.090110 0.819757
17 17 34 History_Civics AppliedLearning 0.142857 0.823529
18 18 22 History_Civics Health_Sports 0.159574 1.000000
19 19 23 History_Civics Literacy_Language 0.184211 0.909890
20 20 26 History_Civics Math_Science 0.135431 0.857143
21 21 12 History_Civics Music_Arts 0.169231 0.840426
22 22 36 History_Civics SpecialNeeds 0.235294 0.815789
23 23 4 Literacy_Language 0.120172 0.864569
24 24 17 Literacy_Language AppliedLearning 0.132826 0.830769
25 25 43 Literacy_Language Health_Sports 0.187380 0.764706
26 26 31 Literacy_Language History_Civics 0.142270 0.879828
27 27 1 Literacy_Language Math_Science 0.183753 0.867174
28 28 24 Literacy_Language Music_Arts 0.151042 0.812620
29 29 13 Literacy_Language SpecialNeeds 0.194915 0.857730
30 30 46 Literacy_Language Warmth Care_Hunger 0.183333 1.000000
31 31 0 Math_Science 0.142446 0.816247
32 32 18 Math_Science AppliedLearning 0.171815 0.848958
33 33 35 Math_Science Health_Sports 0.184165 0.805085
34 34 33 Math_Science History_Civics 0.333333 0.816667
35 35 20 Math_Science Literacy_Language 0.146590 0.857554
36 36 15 Math_Science Music_Arts 0.500000 0.828185
37 37 2 Math_Science SpecialNeeds 0.285714 0.815835
38 38 41 Math_Science Warmth Care_Hunger 0.428571 0.666667
39 39 14 Music_Arts 0.086957 0.853410
40 40 44 Music_Arts AppliedLearning 0.182177 0.500000
41 41 39 Music_Arts Health_Sports 0.333333 0.714286
42 42 47 Music_Arts History_Civics 0.186275 0.571429
43 43 37 Music_Arts SpecialNeeds 0.059382 0.913043
44 44 3 SpecialNeeds NaN 0.817823
45 45 45 SpecialNeeds Health_Sports NaN 0.666667
46 46 10 SpecialNeeds Music_Arts NaN 0.813725
47 47 48 SpecialNeeds Warmth Care_Hunger NaN 1.000000
48 48 7 Warmth Care_Hunger NaN 0.940618
In [29]:
category_1.drop(['level_0','index'],axis=1,inplace=True)
category_1.head(2)
Out[29]:
clean_categories labels labels
0 AppliedLearning 0.183074 0.816926
1 AppliedLearning Health_Sports 0.169399 0.830601
In [30]:
category_1 = df_column_uniquify(category_1)
In [31]:
category_1.head(2)
Out[31]:
clean_categories labels labels_1
0 AppliedLearning 0.183074 0.816926
1 AppliedLearning Health_Sports 0.169399 0.830601
In [32]:
category_1.rename(columns={'labels':'Category_0','labels_1':'Category_1'},inplace=True)
In [33]:
category_1["Category_0"].fillna( method ='ffill', inplace = True)
category_1["Category_1"].fillna( method ='ffill', inplace = True)
In [34]:
project_data_train = pd.merge(project_data_train, category_1, on='clean_categories', how='left').reset_index()
In [35]:
project_data_train.drop(['index','clean_categories'],axis=1, inplace=True)
In [36]:
project_data_train.head(1)
Out[36]:
id teacher_id teacher_prefix school_state project_submitted_datetime project_grade_category project_subject_subcategories project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary teacher_number_of_previously_posted_projects labels Category_0 Category_1
0 p097740 0d153ad81f03058ce80c0c3c697b77b5 Teacher CA 2017-03-31 00:34:13 Grades 6-8 Applied Sciences, Mathematics A Mindstorm is Brewing When I tell people what I teach, the response ... LEGO Mindstorms will give my students an exper... NaN NaN My students need access to meaningful applicat... 1 1 0.142446 0.816247

preprocessing of project_subject_categories - Test Data

In [37]:
catogories = list(project_data_test['project_subject_categories'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039

# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python
cat_list = []
for i in catogories:
    temp = ""
    # consider we have text like this "Math & Science, Warmth, Care & Hunger"
    for j in i.split(','): # it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
        if 'The' in j.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
        temp+=j.strip()+" " #" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_') # we are replacing the & value into 
    cat_list.append(temp.strip())
    
project_data_test['clean_categories'] = cat_list
project_data_test.drop(['project_subject_categories'], axis=1, inplace=True)
In [38]:
unique_list_test = []
for x in cat_list:
    if x not in unique_list_test: 
            unique_list_test.append(x)
            
#https://stackoverflow.com/questions/41125909/python-find-elements-in-one-list-that-are-not-in-the-other       

difference=list(set(unique_list_test).difference(unique_list))
print(difference)
['Music_Arts Warmth Care_Hunger']
In [39]:
df1=pd.DataFrame([['Music_Arts Warmth Care_Hunger',0.5,0.5]],columns=['clean_categories','Category_0','Category_1'])
In [40]:
category_1=category_1.append(df1, ignore_index = True) 
In [41]:
project_data_test = pd.merge(project_data_test, category_1, on='clean_categories', how='left').reset_index()
In [42]:
project_data_test.drop(['clean_categories','Unnamed: 0'],axis=1, inplace=True)

preprocessing of project_subject_subcategories - Train Data

In [43]:
sub_catogories = list(project_data_train['project_subject_subcategories'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039

# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python

sub_cat_list = []
for i in sub_catogories:
    temp = ""
    # consider we have text like this "Math & Science, Warmth, Care & Hunger"
    for j in i.split(','): # it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
        if 'The' in j.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    sub_cat_list.append(temp.strip())

project_data_train['clean_subcategories'] = sub_cat_list
project_data_train.drop(['project_subject_subcategories'], axis=1, inplace=True)
In [44]:
unique_list = []
for x in sub_cat_list:
    if x not in unique_list: 
            unique_list.append(x)
    
    
categories=pd.DataFrame({'clean_subcategories': unique_list})
categories=categories.sort_values(['clean_subcategories'], ascending=True).reset_index()
In [45]:
df1=project_data_train[['clean_subcategories','labels']][(project_data_train['labels']==1)]
In [46]:
df2=project_data_train[['clean_subcategories','labels']][(project_data_train['labels']==0)]
In [47]:
z =df1.groupby(['clean_subcategories'])['labels'].value_counts() /project_data_train.groupby(['clean_subcategories'])['labels'].count()
group_1=pd.DataFrame(z)
group_1=group_1.reset_index(drop=True)
In [48]:
z1 =df2.groupby(['clean_subcategories'])['labels'].value_counts() /project_data_train.groupby(['clean_subcategories'])['labels'].count()
group_0=pd.DataFrame(z1)
group_0=group_0.reset_index(drop=True)
In [49]:
x1= df1.groupby(['clean_subcategories'])['labels'].value_counts() 
class_1=pd.DataFrame(x1)
class_1=class_1.reset_index(drop=True)

x0= df2.groupby(['clean_subcategories'])['labels'].value_counts() 
class_0=pd.DataFrame(x0)
class_0=class_0.reset_index(drop=True)
In [50]:
Response_Table = pd.concat([categories, class_0, class_1],axis=1)
Response_Table = df_column_uniquify(Response_Table)
Response_Table.rename(columns={'labels':'Class=0','labels_1':'Class=1'},inplace=True)
print("Response Table for Sub-Categories")
Response_Table
Response Table for Sub-Categories
Out[50]:
index clean_subcategories Class=0 Class=1
0 10 AppliedSciences 137.0 603.0
1 116 AppliedSciences CharacterEducation 4.0 10.0
2 176 AppliedSciences Civics_Government 1.0 4.0
3 53 AppliedSciences College_CareerPrep 16.0 118.0
4 204 AppliedSciences CommunityService 1.0 6.0
5 142 AppliedSciences ESL 4.0 23.0
6 73 AppliedSciences EarlyDevelopment 12.0 47.0
7 331 AppliedSciences Economics 1.0 228.0
8 32 AppliedSciences EnvironmentalScience 57.0 42.0
9 91 AppliedSciences Extracurricular 3.0 5.0
10 360 AppliedSciences FinancialLiteracy 1.0 134.0
11 334 AppliedSciences ForeignLanguages 2.0 9.0
12 306 AppliedSciences Gym_Fitness 1.0 16.0
13 15 AppliedSciences Health_LifeScience 26.0 147.0
14 124 AppliedSciences Health_Wellness 1.0 110.0
15 218 AppliedSciences History_Geography 6.0 823.0
16 47 AppliedSciences Literacy 28.0 15.0
17 54 AppliedSciences Literature_Writing 17.0 30.0
18 0 AppliedSciences Mathematics 187.0 17.0
19 214 AppliedSciences Music 4.0 8.0
20 365 AppliedSciences NutritionEducation 1.0 14.0
21 144 AppliedSciences Other 5.0 99.0
22 43 AppliedSciences ParentInvolvement 4.0 3.0
23 83 AppliedSciences PerformingArts 1.0 174.0
24 169 AppliedSciences SocialSciences 1.0 1.0
25 94 AppliedSciences SpecialNeeds 15.0 87.0
26 295 AppliedSciences TeamSports 39.0 2.0
27 34 AppliedSciences VisualArts 21.0 21.0
28 229 AppliedSciences Warmth Care_Hunger 4.0 16.0
29 46 CharacterEducation 2.0 4.0
... ... ... ... ...
339 267 NutritionEducation TeamSports NaN 9.0
340 147 NutritionEducation VisualArts NaN 120.0
341 41 Other NaN 2.0
342 250 Other ParentInvolvement NaN 5.0
343 177 Other PerformingArts NaN 3.0
344 326 Other SocialSciences NaN 19.0
345 85 Other SpecialNeeds NaN 48.0
346 271 Other TeamSports NaN 10.0
347 215 Other VisualArts NaN 19.0
348 301 ParentInvolvement NaN 1037.0
349 273 ParentInvolvement PerformingArts NaN 6.0
350 81 ParentInvolvement SocialSciences NaN 83.0
351 275 ParentInvolvement SpecialNeeds NaN 5.0
352 105 ParentInvolvement VisualArts NaN 253.0
353 90 PerformingArts NaN 546.0
354 342 PerformingArts SocialSciences NaN 396.0
355 166 PerformingArts SpecialNeeds NaN NaN
356 200 PerformingArts TeamSports NaN NaN
357 117 PerformingArts VisualArts NaN NaN
358 69 SocialSciences NaN NaN
359 246 SocialSciences SpecialNeeds NaN NaN
360 24 SocialSciences VisualArts NaN NaN
361 3 SpecialNeeds NaN NaN
362 304 SpecialNeeds TeamSports NaN NaN
363 18 SpecialNeeds VisualArts NaN NaN
364 343 SpecialNeeds Warmth Care_Hunger NaN NaN
365 33 TeamSports NaN NaN
366 327 TeamSports VisualArts NaN NaN
367 26 VisualArts NaN NaN
368 13 Warmth Care_Hunger NaN NaN

369 rows × 4 columns

In [51]:
category_1 = pd.concat([categories,group_0,group_1],axis=1).reset_index()
category_1.head(2)
Out[51]:
level_0 index clean_subcategories labels labels
0 0 10 AppliedSciences 0.185135 0.814865
1 1 116 AppliedSciences CharacterEducation 0.285714 0.714286
In [52]:
category_1.drop(['index'],axis=1,inplace=True)
category_1.head(2)
Out[52]:
level_0 clean_subcategories labels labels
0 0 AppliedSciences 0.185135 0.814865
1 1 AppliedSciences CharacterEducation 0.285714 0.714286
In [53]:
category_1 = df_column_uniquify(category_1)
category_1.head(2)
category_1.rename(columns={'labels':'SubCategory_0','labels_1':'SubCategory_1'},inplace=True)
category_1.head(2)
Out[53]:
level_0 clean_subcategories SubCategory_0 SubCategory_1
0 0 AppliedSciences 0.185135 0.814865
1 1 AppliedSciences CharacterEducation 0.285714 0.714286
In [54]:
category_1["SubCategory_0"].fillna( method ='ffill', inplace = True)
category_1["SubCategory_1"].fillna( method ='ffill', inplace = True)
In [55]:
project_data_train = pd.merge(project_data_train, category_1, on='clean_subcategories', how='left').reset_index()
In [56]:
project_data_train.drop(['clean_subcategories'],axis=1, inplace=True)

preprocessing of project_subject_subcategories - Test Data

In [57]:
sub_catogories = list(project_data_test['project_subject_subcategories'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039

# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python

sub_cat_list = []
for i in sub_catogories:
    temp = ""
    # consider we have text like this "Math & Science, Warmth, Care & Hunger"
    for j in i.split(','): # it will split it in three parts ["Math & Science", "Warmth", "Care & Hunger"]
        if 'The' in j.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    sub_cat_list.append(temp.strip())

project_data_test['clean_subcategories'] = sub_cat_list
project_data_test.drop(['project_subject_subcategories'], axis=1, inplace=True)
In [58]:
unique_list_test = []
for x in sub_cat_list:
    if x not in unique_list_test: 
            unique_list_test.append(x)
            
#https://stackoverflow.com/questions/41125909/python-find-elements-in-one-list-that-are-not-in-the-other       

difference=list(set(unique_list_test).difference(unique_list))
print(difference)
['Other Warmth Care_Hunger', 'Civics_Government Extracurricular', 'CommunityService NutritionEducation', 'Gym_Fitness SocialSciences', 'ParentInvolvement Warmth Care_Hunger', 'CommunityService Economics', 'Mathematics Warmth Care_Hunger', 'FinancialLiteracy VisualArts', 'ESL Gym_Fitness', 'ForeignLanguages PerformingArts', 'College_CareerPrep Warmth Care_Hunger', 'VisualArts Warmth Care_Hunger', 'CharacterEducation Warmth Care_Hunger', 'CommunityService EarlyDevelopment', 'ParentInvolvement TeamSports']
In [59]:
df1=pd.DataFrame([['Other Warmth Care_Hunger',0.5,0.5],['Civics_Government Extracurricular',0.5,0.5],['CommunityService NutritionEducation',0.5,0.5],['Gym_Fitness SocialSciences',0.5,0.5],['ParentInvolvement Warmth Care_Hunger',0.5,0.5],['CommunityService Economics',0.5,0.5],['Mathematics Warmth Care_Hunger',0.5,0.5],['FinancialLiteracy VisualArts',0.5,0.5],['ESL Gym_Fitness',0.5,0.5],['ForeignLanguages PerformingArts',0.5,0.5],['College_CareerPrep Warmth Care_Hunger',0.5,0.5],['VisualArts Warmth Care_Hunger',0.5,0.5],['CharacterEducation Warmth Care_Hunger',0.5,0.5],['CommunityService EarlyDevelopment',0.5,0.5],['ParentInvolvement TeamSports',0.5,0.5]],columns=['clean_subcategories','SubCategory_0','SubCategory_1'])
In [60]:
category_1=category_1.append(df1, ignore_index = True) 
In [61]:
category_1.drop(['level_0'],axis=1,inplace=True)
category_1.head(1)
Out[61]:
SubCategory_0 SubCategory_1 clean_subcategories
0 0.185135 0.814865 AppliedSciences
In [62]:
project_data_test = pd.merge(project_data_test, category_1, on='clean_subcategories', how='left')

Text preprocessing - Train Data

In [63]:
# merge two column text dataframe: 
project_data_train["essay"] = project_data_train["project_essay_1"].map(str) +\
                              project_data_train["project_essay_2"].map(str) + \
                              project_data_train["project_essay_3"].map(str) + \
                              project_data_train["project_essay_4"].map(str)
In [64]:
# printing some random reviews
print(project_data_train['essay'].values[0])
print("="*50)
When I tell people what I teach, the response is often the same, \"Wow, you must be so patient\" or \"You are so nice for doing that\". I do not agree with them and I've never truly understood their odd, pitying responses. I believe it is a common response because people view my special education students as being so different. \r\n\r\nMy students are completely misunderstood. Yes, they can get loud, are very active, and fairly low academically; but they are also students who just want to learn.\r\n\r\nWhile my students are given access to the same resources and supplies the general education population receives, my students often need more. They all learn in a myriad of ways, and therefore need resources and supplies that match. I'm always looking for new, innovative ways to teach and engage my students but they are generally out of my school's budget. \r\n\r\nMy students will spend their entire academic careers overcoming hundreds of obstacles, because of who they are. My hope is that, in my classroom, money will not be an obstacle my students have to overcome.LEGO Mindstorms will give my students an experiential learning platform for hands-on application of their scientific and mathematical skills that matches the rigor of Common Core and Next Generation Science Standards. Specific skills they will develop are scientific problem-solving, computer coding, engineering design, and peer-to-peer communication. Since LEGO Mindstorms is a versatile resource, as their skills develop so will the number of ways they are able to engage with it making this a long-term learning tool. \r\n\r\nThis will improve their school lives by giving them the opportunity to create something they can be excited about, something they've never been told they could do before. \r\n\r\nI know with the LEGO Mindstorms, my student's will be engaged and excited to dive into the world of STEM and all that it has to offer them. This resource can challenge them and increase the rigor in the classroom, something they have not experienced before.nannan
==================================================
In [65]:
# https://stackoverflow.com/a/47091490/4084039
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase
In [66]:
sent = decontracted(project_data_train['essay'].values[0])
print(sent)
print("="*50)
When I tell people what I teach, the response is often the same, \"Wow, you must be so patient\" or \"You are so nice for doing that\". I do not agree with them and I have never truly understood their odd, pitying responses. I believe it is a common response because people view my special education students as being so different. \r\n\r\nMy students are completely misunderstood. Yes, they can get loud, are very active, and fairly low academically; but they are also students who just want to learn.\r\n\r\nWhile my students are given access to the same resources and supplies the general education population receives, my students often need more. They all learn in a myriad of ways, and therefore need resources and supplies that match. I am always looking for new, innovative ways to teach and engage my students but they are generally out of my school is budget. \r\n\r\nMy students will spend their entire academic careers overcoming hundreds of obstacles, because of who they are. My hope is that, in my classroom, money will not be an obstacle my students have to overcome.LEGO Mindstorms will give my students an experiential learning platform for hands-on application of their scientific and mathematical skills that matches the rigor of Common Core and Next Generation Science Standards. Specific skills they will develop are scientific problem-solving, computer coding, engineering design, and peer-to-peer communication. Since LEGO Mindstorms is a versatile resource, as their skills develop so will the number of ways they are able to engage with it making this a long-term learning tool. \r\n\r\nThis will improve their school lives by giving them the opportunity to create something they can be excited about, something they have never been told they could do before. \r\n\r\nI know with the LEGO Mindstorms, my student is will be engaged and excited to dive into the world of STEM and all that it has to offer them. This resource can challenge them and increase the rigor in the classroom, something they have not experienced before.nannan
==================================================
In [67]:
# \r \n \t remove from string python: http://texthandler.com/info/remove-line-breaks-python/
sent = sent.replace('\\r', ' ')
sent = sent.replace('\\"', ' ')
sent = sent.replace('\\n', ' ')
print(sent)
When I tell people what I teach, the response is often the same,  Wow, you must be so patient  or  You are so nice for doing that . I do not agree with them and I have never truly understood their odd, pitying responses. I believe it is a common response because people view my special education students as being so different.     My students are completely misunderstood. Yes, they can get loud, are very active, and fairly low academically; but they are also students who just want to learn.    While my students are given access to the same resources and supplies the general education population receives, my students often need more. They all learn in a myriad of ways, and therefore need resources and supplies that match. I am always looking for new, innovative ways to teach and engage my students but they are generally out of my school is budget.     My students will spend their entire academic careers overcoming hundreds of obstacles, because of who they are. My hope is that, in my classroom, money will not be an obstacle my students have to overcome.LEGO Mindstorms will give my students an experiential learning platform for hands-on application of their scientific and mathematical skills that matches the rigor of Common Core and Next Generation Science Standards. Specific skills they will develop are scientific problem-solving, computer coding, engineering design, and peer-to-peer communication. Since LEGO Mindstorms is a versatile resource, as their skills develop so will the number of ways they are able to engage with it making this a long-term learning tool.     This will improve their school lives by giving them the opportunity to create something they can be excited about, something they have never been told they could do before.     I know with the LEGO Mindstorms, my student is will be engaged and excited to dive into the world of STEM and all that it has to offer them. This resource can challenge them and increase the rigor in the classroom, something they have not experienced before.nannan
In [68]:
#remove spacial character: https://stackoverflow.com/a/5843547/4084039
sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
print(sent)
When I tell people what I teach the response is often the same Wow you must be so patient or You are so nice for doing that I do not agree with them and I have never truly understood their odd pitying responses I believe it is a common response because people view my special education students as being so different My students are completely misunderstood Yes they can get loud are very active and fairly low academically but they are also students who just want to learn While my students are given access to the same resources and supplies the general education population receives my students often need more They all learn in a myriad of ways and therefore need resources and supplies that match I am always looking for new innovative ways to teach and engage my students but they are generally out of my school is budget My students will spend their entire academic careers overcoming hundreds of obstacles because of who they are My hope is that in my classroom money will not be an obstacle my students have to overcome LEGO Mindstorms will give my students an experiential learning platform for hands on application of their scientific and mathematical skills that matches the rigor of Common Core and Next Generation Science Standards Specific skills they will develop are scientific problem solving computer coding engineering design and peer to peer communication Since LEGO Mindstorms is a versatile resource as their skills develop so will the number of ways they are able to engage with it making this a long term learning tool This will improve their school lives by giving them the opportunity to create something they can be excited about something they have never been told they could do before I know with the LEGO Mindstorms my student is will be engaged and excited to dive into the world of STEM and all that it has to offer them This resource can challenge them and increase the rigor in the classroom something they have not experienced before nannan
In [69]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
stopwords= ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"]
In [70]:
# Combining all the above stundents 
from tqdm import tqdm
preprocessed_essays = []
# tqdm is for printing the status bar
for sentance in tqdm(project_data_train['essay'].values):
    sent = decontracted(sentance)
    sent = sent.replace('\\r', ' ')
    sent = sent.replace('\\"', ' ')
    sent = sent.replace('\\n', ' ')
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    # https://gist.github.com/sebleier/554280
    sent = ' '.join(e for e in sent.split() if e.lower() not in stopwords)
    preprocessed_essays.append(sent.lower().strip())
    
# count of all the words in corpus python: https://stackoverflow.com/a/22898595/4084039
my_counter = Counter()
for word in preprocessed_essays:
    my_counter.update(word.split())
    
essay_dict = dict(my_counter)
sorted_essays_dict = dict(sorted(essay_dict.items(), key=lambda kv: kv[1]))
100%|██████████████████████████████████████████████████████████████████████████| 33500/33500 [00:17<00:00, 1942.43it/s]
In [71]:
preprocessed_essays[0]
Out[71]:
'tell people teach response often wow must patient nice not agree never truly understood odd pitying responses believe common response people view special education students different students completely misunderstood yes get loud active fairly low academically also students want learn students given access resources supplies general education population receives students often need learn myriad ways therefore need resources supplies match always looking new innovative ways teach engage students generally school budget students spend entire academic careers overcoming hundreds obstacles hope classroom money not obstacle students overcome lego mindstorms give students experiential learning platform hands application scientific mathematical skills matches rigor common core next generation science standards specific skills develop scientific problem solving computer coding engineering design peer peer communication since lego mindstorms versatile resource skills develop number ways able engage making long term learning tool improve school lives giving opportunity create something excited something never told could know lego mindstorms student engaged excited dive world stem offer resource challenge increase rigor classroom something not experienced nannan'

Text Preprocessing - For Test Data

In [72]:
# merge two column text dataframe: 
project_data_test["essay"] = project_data_test["project_essay_1"].map(str) +\
                             project_data_test["project_essay_2"].map(str) + \
                             project_data_test["project_essay_3"].map(str) + \
                             project_data_test["project_essay_4"].map(str)
In [73]:
# printing some random essays.
print(project_data_test['essay'].values[0])
My students are bright and adorable busy bees! Our school is located in South Florida and is very diverse. Some of my students speak English, Spanish and Creole. All together our students have a broad range of academics and as a magnet school our students make deeper learning experiences in fields like Science,Technology, Engineering, and Mathematics. There are many languages, backgrounds and life experiences represented in my class. \r\n\r\nMy class is a fun and energetic group of first graders.\r\nThey love to dance and are eager to learn. We have been busy practicing collaborating, solving problems and working independently. In my class, we are constantly singing and dancing, doing reading and writing projects, and playing academic games. My goal for my students is to inspire them to become life-long readers that have a passion and love for learning using technology and differentiated instruction. \r\nI want my students to have the opportunity to utilize technology to practice their reading/math skills on a daily basis which will increase learning and the quality of their education. Majority of my students and the schools population have very limited access to technology and the internet. For most of my students, the only  chance they get to use a computer or tablet is at school. If that school or classroom (LIKE MINE) does not have computers than their chances are even smaller. Knowing that my students are eager and capable of achieving mastery level academic grades and awards, I want provide them with the resources and tools to a bright future. \r\nWith these Amazon Fire Tablets  in my classroom, students can have access to the technology every day, which would improve their problem solving skills as well as their literacy development. Currently, there are over 1.2 million apps available at the Google Play app store. I feel having access to such an abundant source of information where apps can be downloaded to practice various skills will help close the gap for some of those students and enable them to see and/or learn about things they thought they'd never have a chance to experience. These tablets would allow my students time to work independently each day on science research, math practice and challenges, art projects, reading and literacy, which would allow me time to work with smaller groups of students.\r\n\r\nAny donations made would improve my classroom tremendously!\r\nMany of my students have little or no access to technology, except when they are at school. And even when they are at school, it is limited. Getting these Amazon Fire Tablets would allow my students to become life long learners and Soar in to a bright future!nannan
In [74]:
# https://stackoverflow.com/a/47091490/4084039
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase
In [75]:
sent_test = decontracted(project_data_test['essay'].values[0])
print(sent_test)
My students are bright and adorable busy bees! Our school is located in South Florida and is very diverse. Some of my students speak English, Spanish and Creole. All together our students have a broad range of academics and as a magnet school our students make deeper learning experiences in fields like Science,Technology, Engineering, and Mathematics. There are many languages, backgrounds and life experiences represented in my class. \r\n\r\nMy class is a fun and energetic group of first graders.\r\nThey love to dance and are eager to learn. We have been busy practicing collaborating, solving problems and working independently. In my class, we are constantly singing and dancing, doing reading and writing projects, and playing academic games. My goal for my students is to inspire them to become life-long readers that have a passion and love for learning using technology and differentiated instruction. \r\nI want my students to have the opportunity to utilize technology to practice their reading/math skills on a daily basis which will increase learning and the quality of their education. Majority of my students and the schools population have very limited access to technology and the internet. For most of my students, the only  chance they get to use a computer or tablet is at school. If that school or classroom (LIKE MINE) does not have computers than their chances are even smaller. Knowing that my students are eager and capable of achieving mastery level academic grades and awards, I want provide them with the resources and tools to a bright future. \r\nWith these Amazon Fire Tablets  in my classroom, students can have access to the technology every day, which would improve their problem solving skills as well as their literacy development. Currently, there are over 1.2 million apps available at the Google Play app store. I feel having access to such an abundant source of information where apps can be downloaded to practice various skills will help close the gap for some of those students and enable them to see and/or learn about things they thought they would never have a chance to experience. These tablets would allow my students time to work independently each day on science research, math practice and challenges, art projects, reading and literacy, which would allow me time to work with smaller groups of students.\r\n\r\nAny donations made would improve my classroom tremendously!\r\nMany of my students have little or no access to technology, except when they are at school. And even when they are at school, it is limited. Getting these Amazon Fire Tablets would allow my students to become life long learners and Soar in to a bright future!nannan
In [76]:
# \r \n \t remove from string python: http://texthandler.com/info/remove-line-breaks-python/
sent_test = sent_test.replace('\\r', ' ')
sent_test = sent_test.replace('\\"', ' ')
sent_test = sent_test.replace('\\n', ' ')
print(sent_test)
My students are bright and adorable busy bees! Our school is located in South Florida and is very diverse. Some of my students speak English, Spanish and Creole. All together our students have a broad range of academics and as a magnet school our students make deeper learning experiences in fields like Science,Technology, Engineering, and Mathematics. There are many languages, backgrounds and life experiences represented in my class.     My class is a fun and energetic group of first graders.  They love to dance and are eager to learn. We have been busy practicing collaborating, solving problems and working independently. In my class, we are constantly singing and dancing, doing reading and writing projects, and playing academic games. My goal for my students is to inspire them to become life-long readers that have a passion and love for learning using technology and differentiated instruction.   I want my students to have the opportunity to utilize technology to practice their reading/math skills on a daily basis which will increase learning and the quality of their education. Majority of my students and the schools population have very limited access to technology and the internet. For most of my students, the only  chance they get to use a computer or tablet is at school. If that school or classroom (LIKE MINE) does not have computers than their chances are even smaller. Knowing that my students are eager and capable of achieving mastery level academic grades and awards, I want provide them with the resources and tools to a bright future.   With these Amazon Fire Tablets  in my classroom, students can have access to the technology every day, which would improve their problem solving skills as well as their literacy development. Currently, there are over 1.2 million apps available at the Google Play app store. I feel having access to such an abundant source of information where apps can be downloaded to practice various skills will help close the gap for some of those students and enable them to see and/or learn about things they thought they would never have a chance to experience. These tablets would allow my students time to work independently each day on science research, math practice and challenges, art projects, reading and literacy, which would allow me time to work with smaller groups of students.    Any donations made would improve my classroom tremendously!  Many of my students have little or no access to technology, except when they are at school. And even when they are at school, it is limited. Getting these Amazon Fire Tablets would allow my students to become life long learners and Soar in to a bright future!nannan
In [77]:
#remove spacial character: https://stackoverflow.com/a/5843547/4084039
sent_test = re.sub('[^A-Za-z0-9]+', ' ', sent_test)
print(sent_test)
My students are bright and adorable busy bees Our school is located in South Florida and is very diverse Some of my students speak English Spanish and Creole All together our students have a broad range of academics and as a magnet school our students make deeper learning experiences in fields like Science Technology Engineering and Mathematics There are many languages backgrounds and life experiences represented in my class My class is a fun and energetic group of first graders They love to dance and are eager to learn We have been busy practicing collaborating solving problems and working independently In my class we are constantly singing and dancing doing reading and writing projects and playing academic games My goal for my students is to inspire them to become life long readers that have a passion and love for learning using technology and differentiated instruction I want my students to have the opportunity to utilize technology to practice their reading math skills on a daily basis which will increase learning and the quality of their education Majority of my students and the schools population have very limited access to technology and the internet For most of my students the only chance they get to use a computer or tablet is at school If that school or classroom LIKE MINE does not have computers than their chances are even smaller Knowing that my students are eager and capable of achieving mastery level academic grades and awards I want provide them with the resources and tools to a bright future With these Amazon Fire Tablets in my classroom students can have access to the technology every day which would improve their problem solving skills as well as their literacy development Currently there are over 1 2 million apps available at the Google Play app store I feel having access to such an abundant source of information where apps can be downloaded to practice various skills will help close the gap for some of those students and enable them to see and or learn about things they thought they would never have a chance to experience These tablets would allow my students time to work independently each day on science research math practice and challenges art projects reading and literacy which would allow me time to work with smaller groups of students Any donations made would improve my classroom tremendously Many of my students have little or no access to technology except when they are at school And even when they are at school it is limited Getting these Amazon Fire Tablets would allow my students to become life long learners and Soar in to a bright future nannan
In [78]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
stopwords= ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"]
In [79]:
# Combining all the above statemennts 
from tqdm import tqdm
preprocessed_essays_test = []
# tqdm is for printing the status bar
for sentence in tqdm(project_data_test['essay'].values):
    sent_cv = decontracted(sentence)
    sent_cv = sent_cv.replace('\\r', ' ')
    sent_cv = sent_cv.replace('\\"', ' ')
    sent_cv = sent_cv.replace('\\n', ' ')
    sent_cv = re.sub('[^A-Za-z0-9]+', ' ', sent_cv)
    # https://gist.github.com/sebleier/554280
    sent_cv = ' '.join(e for e in sent_cv.split() if e not in stopwords)
    preprocessed_essays_test.append(sent_cv.lower().strip())
100%|██████████████████████████████████████████████████████████████████████████| 16500/16500 [00:08<00:00, 1913.49it/s]
In [80]:
# after preprocesing
preprocessed_essays_test[0]
Out[80]:
'my students bright adorable busy bees our school located south florida diverse some students speak english spanish creole all together students broad range academics magnet school students make deeper learning experiences fields like science technology engineering mathematics there many languages backgrounds life experiences represented class my class fun energetic group first graders they love dance eager learn we busy practicing collaborating solving problems working independently in class constantly singing dancing reading writing projects playing academic games my goal students inspire become life long readers passion love learning using technology differentiated instruction i want students opportunity utilize technology practice reading math skills daily basis increase learning quality education majority students schools population limited access technology internet for students chance get use computer tablet school if school classroom like mine not computers chances even smaller knowing students eager capable achieving mastery level academic grades awards i want provide resources tools bright future with amazon fire tablets classroom students access technology every day would improve problem solving skills well literacy development currently 1 2 million apps available google play app store i feel access abundant source information apps downloaded practice various skills help close gap students enable see learn things thought would never chance experience these tablets would allow students time work independently day science research math practice challenges art projects reading literacy would allow time work smaller groups students any donations made would improve classroom tremendously many students little no access technology except school and even school limited getting amazon fire tablets would allow students become life long learners soar bright future nannan'

Preprocessing of `project_title` - Train Data

In [81]:
# printing some random title.
print(project_data_train['project_title'].values[0])
print("="*50)
A Mindstorm is Brewing
==================================================
In [82]:
sent = decontracted(project_data_train['project_title'].values[0])
print(sent)
print("="*50)
A Mindstorm is Brewing
==================================================
In [83]:
sent = sent.replace('\\r', ' ')
sent = sent.replace('\\"', ' ')
sent = sent.replace('\\n', ' ')
print(sent)
A Mindstorm is Brewing
In [84]:
# Combining all the above statemennts 
from tqdm import tqdm
preprocessed_title = []
# tqdm is for printing the status bar
for sentance in tqdm(project_data_train['project_title'].values):
    sent = decontracted(sentance)
    sent = sent.replace('\\r', ' ')
    sent = sent.replace('\\"', ' ')
    sent = sent.replace('\\n', ' ')
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    # https://gist.github.com/sebleier/554280
    sent = ' '.join(e for e in sent.split() if e not in stopwords)
    preprocessed_title.append(sent.lower().strip())
    
my_counter = Counter()
for word in preprocessed_title:
    my_counter.update(word.split())
    
title_dict = dict(my_counter)
sorted_title_dict = dict(sorted(title_dict.items(), key=lambda kv: kv[1]))
100%|█████████████████████████████████████████████████████████████████████████| 33500/33500 [00:00<00:00, 38229.32it/s]
In [85]:
# after preprocesing
preprocessed_title[0]
Out[85]:
'a mindstorm brewing'

Preprocessing of `project_title - For Test Data`

In [86]:
# printing some random title.
print(project_data_test['project_title'].values[0])
print("="*50)
Rich at Heart 1st Graders Need Tablets for Reading/STEM!
==================================================
In [87]:
# Combining all the above statemennts 
from tqdm import tqdm
preprocessed_title_test = []
# tqdm is for printing the status bar
for sentance in tqdm(project_data_test['project_title'].values):
    sent = decontracted(sentance)
    sent = sent.replace('\\r', ' ')
    sent = sent.replace('\\"', ' ')
    sent = sent.replace('\\n', ' ')
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
    # https://gist.github.com/sebleier/554280
    sent = ' '.join(e for e in sent.split() if e not in stopwords)
    preprocessed_title_test.append(sent.lower().strip())
100%|█████████████████████████████████████████████████████████████████████████| 16500/16500 [00:00<00:00, 37491.34it/s]
In [88]:
# after preprocesing
preprocessed_title_test[0]
Out[88]:
'rich heart 1st graders need tablets reading stem'

1.5 Preparing data for models

In [89]:
project_data_train.columns
Out[89]:
Index(['index', 'id', 'teacher_id', 'teacher_prefix', 'school_state',
       'project_submitted_datetime', 'project_grade_category', 'project_title',
       'project_essay_1', 'project_essay_2', 'project_essay_3',
       'project_essay_4', 'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'labels', 'Category_0',
       'Category_1', 'level_0', 'SubCategory_0', 'SubCategory_1', 'essay'],
      dtype='object')
In [90]:
project_data_test.columns
Out[90]:
Index(['index', 'id', 'teacher_id', 'teacher_prefix', 'school_state',
       'project_submitted_datetime', 'project_grade_category', 'project_title',
       'project_essay_1', 'project_essay_2', 'project_essay_3',
       'project_essay_4', 'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'Category_0',
       'Category_1', 'clean_subcategories', 'SubCategory_0', 'SubCategory_1',
       'essay'],
      dtype='object')

Encoding for State - Train Data

In [91]:
state = list(project_data_train['school_state'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039

# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python

state_list = []
for i in state:
    temp = ""
    
    for j in i.split(','): # it will split it in parts 
        if 'The' in j.split(): # this will split each of the state based on space 
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    state_list.append(temp.strip())
    
project_data_train['clean_state'] = state_list
project_data_train.drop(['school_state'], axis=1, inplace=True)
In [92]:
unique_list = []
for x in state_list:
    if x not in unique_list: 
            unique_list.append(x)
    
    
categories=pd.DataFrame({'clean_state': unique_list})
categories=categories.sort_values(['clean_state'], ascending=True).reset_index()

df1=project_data_train[['clean_state','labels']][(project_data_train['labels']==1)]

df2=project_data_train[['clean_state','labels']][(project_data_train['labels']==0)]
In [93]:
z =df1.groupby(['clean_state'])['labels'].value_counts() /project_data_train.groupby(['clean_state'])['labels'].count()

group_1=pd.DataFrame(z)

group_1=group_1.reset_index(drop=True)
print(group_1.head(2))
     labels
0  0.868687
1  0.848030
In [94]:
z1 =df2.groupby(['clean_state'])['labels'].value_counts() /project_data_train.groupby(['clean_state'])['labels'].count()

group_0=pd.DataFrame(z1)

group_0=group_0.reset_index(drop=True)
print(group_0.head(2))
     labels
0  0.131313
1  0.151970
In [95]:
x1= df1.groupby(['clean_state'])['labels'].value_counts() 
class_1=pd.DataFrame(x1)
class_1=class_1.reset_index(drop=True)
print ( class_1.head(2))
   labels
0      86
1     452
In [96]:
x0= df2.groupby(['clean_state'])['labels'].value_counts() 
class_0=pd.DataFrame(x0)
class_0=class_0.reset_index(drop=True)
print ( class_0.head(2))
   labels
0      13
1      81
In [97]:
Response_Table = pd.concat([categories, class_0, class_1],axis=1)
Response_Table = df_column_uniquify(Response_Table)
Response_Table.rename(columns={'labels':'Class=0','labels_1':'Class=1'},inplace=True)

print("Response Table for State")
Response_Table
Response Table for State
Out[97]:
index clean_state Class=0 Class=1
0 44 AK 13 86
1 2 AL 81 452
2 29 AR 61 247
3 34 AZ 103 555
4 0 CA 687 4050
5 19 CO 59 307
6 39 CT 65 466
7 42 DC 35 138
8 32 DE 15 89
9 4 FL 336 1585
10 18 GA 194 1040
11 20 HI 21 130
12 27 IA 32 178
13 35 ID 41 153
14 31 IL 200 1118
15 17 IN 129 657
16 45 KS 23 143
17 28 KY 47 353
18 13 LA 138 591
19 1 MA 108 633
20 14 MD 71 386
21 26 ME 22 116
22 38 MI 153 830
23 25 MN 55 330
24 21 MO 116 683
25 43 MS 67 347
26 47 MT 15 48
27 10 NC 213 1358
28 49 ND 2 33
29 37 NE 17 79
30 36 NH 12 87
31 24 NJ 127 547
32 41 NM 19 148
33 33 NV 61 367
34 12 NY 307 1952
35 15 OH 88 668
36 6 OK 114 591
37 46 OR 72 303
38 16 PA 130 791
39 23 RI 11 68
40 30 SC 186 1062
41 40 SD 12 80
42 22 TN 81 449
43 9 TX 456 1794
44 3 UT 85 437
45 8 VA 88 493
46 50 VT 5 18
47 7 WA 89 662
48 5 WI 79 476
49 11 WV 21 126
50 48 WY 6 32
In [98]:
category_1 = pd.concat([categories,group_0,group_1],axis=1).reset_index()
category_1.head(2)
Out[98]:
level_0 index clean_state labels labels
0 0 44 AK 0.131313 0.868687
1 1 2 AL 0.151970 0.848030
In [99]:
category_1.drop(['level_0','index'],axis=1,inplace=True)
print("Response Table For Categories")
category_1.head(2)
Response Table For Categories
Out[99]:
clean_state labels labels
0 AK 0.131313 0.868687
1 AL 0.151970 0.848030
In [100]:
category_1 = df_column_uniquify(category_1)
category_1.head(2)
Out[100]:
clean_state labels labels_1
0 AK 0.131313 0.868687
1 AL 0.151970 0.848030
In [101]:
category_1.rename(columns={'labels':'State_0','labels_1':'State_1'},inplace=True)
category_1.head(2)
Out[101]:
clean_state State_0 State_1
0 AK 0.131313 0.868687
1 AL 0.151970 0.848030
In [102]:
category_1["State_0"].fillna( method ='ffill', inplace = True)
category_1["State_1"].fillna( method ='ffill', inplace = True)
In [103]:
project_data_train = pd.merge(project_data_train, category_1, on='clean_state', how='left')
project_data_train.drop(['clean_state'],axis=1, inplace=True)
In [104]:
Cat_0 = list(project_data_train['State_0'].values)
Category_Class_0 = []
for i in Cat_0:
    temp = ""
    
    for j in str(i).split(','): # it will split it in parts 
        if 'The' in j.split(): # this will split each of the state based on space 
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) 
        j = j.replace("NaN",'0')
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    Category_Class_0.append(temp.strip())
    
project_data_train['State_Class_0'] = Category_Class_0
project_data_train.drop(['State_0'], axis=1, inplace=True)
In [105]:
Cat_1 = list(project_data_train['State_1'].values)
Category_Class_1 = []
for i in Cat_1:
    temp = ""
    
    for j in str(i).split(','): # it will split it in parts 
        if 'The' in j.split(): # this will split each of the state based on space 
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) 
        j = j.replace("NaN",'0')
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    Category_Class_1.append(temp.strip())
    
project_data_train['State_Class_1'] = Category_Class_1
project_data_train.drop(['State_1'], axis=1, inplace=True)

Encoding for State- Test Data

In [106]:
state = list(project_data_test['school_state'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039

# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python

state_list_test = []
for i in state:
    temp = ""
    
    for j in i.split(','): # it will split it in parts 
        if 'The' in j.split(): # this will split each of the state based on space 
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    state_list_test.append(temp.strip())
In [107]:
project_data_test['clean_state'] = state_list_test
project_data_test.drop(['school_state'], axis=1, inplace=True)
In [108]:
unique_list_test = []
for x in state_list_test:
    if x not in unique_list_test: 
            unique_list_test.append(x)
            
#https://stackoverflow.com/questions/41125909/python-find-elements-in-one-list-that-are-not-in-the-other       

difference=list(set(unique_list_test).difference(unique_list))
print(difference)
[]
In [109]:
project_data_test = pd.merge(project_data_test, category_1, on='clean_state', how='left')
project_data_test.drop(['clean_state'],axis=1, inplace=True)
In [110]:
State_0 = list(project_data_test['State_0'].values)

State_Class_0 = []
for i in State_0:
    temp = ""
    
    for j in str(i).split(','): # it will split it in parts 
        if 'The' in j.split(): # this will split each of the state based on space 
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) 
        j = j.replace("nan",'0')
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    State_Class_0.append(temp.strip())
    
project_data_test['State_Class_0'] = State_Class_0
project_data_test.drop(['State_0'], axis=1, inplace=True)
In [111]:
State_1 = list(project_data_test['State_1'].values)

State_Class_1 = []
for i in State_1:
    temp = ""
    
    for j in str(i).split(','): # it will split it in parts 
        if 'The' in j.split(): # this will split each of the state based on space 
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) 
        j = j.replace("nan",'0')
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    State_Class_1.append(temp.strip())
    
project_data_test['State_Class_1'] = State_Class_1
project_data_test.drop(['State_1'], axis=1, inplace=True)

Hot Encoding Project Grade Category - Train Data

In [112]:
grade = list(project_data_train['project_grade_category'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039

# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python

grade_list = []
for i in grade:
    temp = ""
    
    for j in i.split(','): # it will split it in parts 
        if 'The' in j.split(): # this will split each of the state based on space 
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) 
        j = j.replace("nan",'')
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    grade_list.append(temp.strip())
    
project_data_train['clean_grade'] = grade_list
project_data_train.drop(['project_grade_category'], axis=1, inplace=True)
In [113]:
unique_list = []
for x in grade_list:
    if x not in unique_list: 
            unique_list.append(x)
    
    
categories=pd.DataFrame({'clean_grade': unique_list})
categories=categories.sort_values(['clean_grade'], ascending=True).reset_index()
In [114]:
df1=project_data_train[['clean_grade','labels']][(project_data_train['labels']==1)]
print(df1.head(2))
  clean_grade  labels
0   Grades6-8       1
1   Grades3-5       1
In [115]:
df2=project_data_train[['clean_grade','labels']][(project_data_train['labels']==0)]
In [116]:
z =df1.groupby(['clean_grade'])['labels'].value_counts() /project_data_train.groupby(['clean_grade'])['labels'].count()

group_1=pd.DataFrame(z)

group_1=group_1.reset_index(drop=True)
print(group_1.head(2))
     labels
0  0.853282
1  0.836197
In [117]:
z1 =df2.groupby(['clean_grade'])['labels'].value_counts() /project_data_train.groupby(['clean_grade'])['labels'].count()

group_0=pd.DataFrame(z1)

group_0=group_0.reset_index(drop=True)
print(group_0.head(2))
     labels
0  0.146718
1  0.163803
In [118]:
x1= df1.groupby(['clean_grade'])['labels'].value_counts() 
class_1=pd.DataFrame(x1)
class_1=class_1.reset_index(drop=True)
print ( class_1.head(2))
   labels
0    9724
1    4380
In [119]:
x0= df2.groupby(['clean_grade'])['labels'].value_counts() 
class_0=pd.DataFrame(x0)
class_0=class_0.reset_index(drop=True)
print ( class_0.head(2))
   labels
0    1672
1     858
In [120]:
Response_Table = pd.concat([categories, class_0, class_1],axis=1)
Response_Table = df_column_uniquify(Response_Table)
Response_Table.rename(columns={'labels':'Class=0','labels_1':'Class=1'},inplace=True)
Response_Table
Out[120]:
index clean_grade Class=0 Class=1
0 1 Grades3-5 1672 9724
1 0 Grades6-8 858 4380
2 3 Grades9-12 543 2744
3 2 GradesPreK-2 2095 11484
In [121]:
category_1 = pd.concat([categories,group_0,group_1],axis=1).reset_index()
category_1.head(2)
Out[121]:
level_0 index clean_grade labels labels
0 0 1 Grades3-5 0.146718 0.853282
1 1 0 Grades6-8 0.163803 0.836197
In [122]:
category_1.drop(['level_0','index'],axis=1,inplace=True)
print("Response Table For Categories")
category_1.head(2)
Response Table For Categories
Out[122]:
clean_grade labels labels
0 Grades3-5 0.146718 0.853282
1 Grades6-8 0.163803 0.836197
In [123]:
category_1 = df_column_uniquify(category_1)

category_1.head(2)

category_1.rename(columns={'labels':'Grade_0','labels_1':'Grade_1'},inplace=True)
category_1.head(2)
Out[123]:
clean_grade Grade_0 Grade_1
0 Grades3-5 0.146718 0.853282
1 Grades6-8 0.163803 0.836197
In [124]:
category_1["Grade_0"].fillna( method ='ffill', inplace = True)
category_1["Grade_1"].fillna( method ='ffill', inplace = True)
In [125]:
project_data_train = pd.merge(project_data_train, category_1, on='clean_grade', how='left')
project_data_train.drop(['clean_grade'],axis=1, inplace=True)

Hot Encoding Project Grade Category - Test Data

In [126]:
grade = list(project_data_test['project_grade_category'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039

# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python

grade_list_test = []
for i in grade:
    temp = ""
    
    for j in i.split(','): # it will split it in parts 
        if 'The' in j.split(): # this will split each of the state based on space 
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) 
        j = j.replace("NaN",'')
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    grade_list_test.append(temp.strip())
In [127]:
project_data_test['clean_grade'] = grade_list_test
project_data_test.drop(['project_grade_category'], axis=1, inplace=True)
In [128]:
unique_list_test = []
for x in grade_list_test:
    if x not in grade_list_test: 
            unique_list_test.append(x)
            
#https://stackoverflow.com/questions/41125909/python-find-elements-in-one-list-that-are-not-in-the-other       

difference=list(set(unique_list_test).difference(unique_list))
print(difference)
[]
In [129]:
project_data_test = pd.merge(project_data_test, category_1, on='clean_grade', how='left')
In [130]:
project_data_test.drop(['clean_grade'],axis=1, inplace=True)
In [131]:
State_0 = list(project_data_test['Grade_0'].values)

State_Class_0 = []
for i in State_0:
    temp = ""
    
    for j in str(i).split(','): # it will split it in parts 
        if 'The' in j.split(): # this will split each of the state based on space 
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) 
        j = j.replace("nan",'0')
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    State_Class_0.append(temp.strip())
    
project_data_test['Grade_Class_0'] = State_Class_0
project_data_test.drop(['Grade_0'], axis=1, inplace=True)
In [132]:
State_1 = list(project_data_test['Grade_1'].values)

State_Class_1 = []
for i in State_1:
    temp = ""
    
    for j in str(i).split(','): # it will split it in parts 
        if 'The' in j.split(): # this will split each of the state based on space 
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) 
        j = j.replace("nan",'0')
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    State_Class_1.append(temp.strip())
    
project_data_test['Grade_Class_1'] = State_Class_1
project_data_test.drop(['Grade_1'], axis=1, inplace=True)

Hot Encoding Teacher Prefix - Train Data

In [133]:
prefix = list(project_data_train['teacher_prefix'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039

# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python

prefix_list = []
for i in prefix:
    temp = ""
    
    for j in str(i).split(','): # it will split it in parts 
        if 'The' in j.split(): # this will split each of the state based on space 
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) 
        j = j.replace("nan",'')
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    prefix_list.append(temp.strip())
In [134]:
project_data_train['clean_prefix'] = prefix_list
project_data_train.drop(['teacher_prefix'], axis=1, inplace=True)
project_data_train.head(2)
Out[134]:
index id teacher_id project_submitted_datetime project_title project_essay_1 project_essay_2 project_essay_3 project_essay_4 project_resource_summary ... Category_1 level_0 SubCategory_0 SubCategory_1 essay State_Class_0 State_Class_1 Grade_0 Grade_1 clean_prefix
0 0 p097740 0d153ad81f03058ce80c0c3c697b77b5 2017-03-31 00:34:13 A Mindstorm is Brewing When I tell people what I teach, the response ... LEGO Mindstorms will give my students an exper... NaN NaN My students need access to meaningful applicat... ... 0.816247 18 0.185149 0.809524 When I tell people what I teach, the response ... 0.14502849905003168 0.8549715009499683 0.163803 0.836197 Teacher
1 1 p153488 363788b51d40d978fe276bcb1f8a2b35 2017-03-30 22:46:14 Color Our Academic World \"All kids need is a little help, a little hop... Collaboration is our middle name! My kids work... NaN NaN My students need markers to be able to work in... ... 0.867174 307 0.059382 0.950000 \"All kids need is a little help, a little hop... 0.14502849905003168 0.8549715009499683 0.146718 0.853282 Mrs.

2 rows × 23 columns

In [135]:
unique_list = []
for x in prefix_list:
    if x not in unique_list: 
            unique_list.append(x)

            
categories=pd.DataFrame({'clean_prefix': unique_list})
categories=categories.sort_values(['clean_prefix'], ascending=True).reset_index()
print(categories.head(2))
   index clean_prefix
0      4          Dr.
1      3          Mr.
In [136]:
df1=project_data_train[['clean_prefix','labels']][(project_data_train['labels']==1)]
df2=project_data_train[['clean_prefix','labels']][(project_data_train['labels']==0)]
In [137]:
z =df1.groupby(['clean_prefix'])['labels'].value_counts() /project_data_train.groupby(['clean_prefix'])['labels'].count()
group_1=pd.DataFrame(z)
group_1=group_1.reset_index(drop=True)

z1 =df2.groupby(['clean_prefix'])['labels'].value_counts() /project_data_train.groupby(['clean_prefix'])['labels'].count()
group_0=pd.DataFrame(z1)
group_0=group_0.reset_index(drop=True)
In [138]:
x1= df1.groupby(['clean_prefix'])['labels'].value_counts() 
class_1=pd.DataFrame(x1)
class_1=class_1.reset_index(drop=True)

x0= df2.groupby(['clean_prefix'])['labels'].value_counts() 
class_0=pd.DataFrame(x0)
class_0=class_0.reset_index(drop=True)
In [139]:
Response_Table = pd.concat([categories, class_0, class_1],axis=1)
Response_Table = df_column_uniquify(Response_Table)
Response_Table.rename(columns={'labels':'Class=0','labels_1':'Class=1'},inplace=True)
Response_Table
Out[139]:
index clean_prefix Class=0 Class=1
0 4 Dr. 1 1
1 3 Mr. 528 2690
2 1 Mrs. 2640 14929
3 2 Ms. 1853 10147
4 0 Teacher 146 565
In [140]:
category_1 = pd.concat([categories,group_0,group_1],axis=1).reset_index()
category_1.drop(['level_0','index'],axis=1,inplace=True)
In [141]:
category_1 = df_column_uniquify(category_1)
category_1.rename(columns={'labels':'Prefix_0','labels_1':'Prefix_1'},inplace=True)
In [142]:
category_1["Prefix_0"].fillna( method ='ffill', inplace = True)
category_1["Prefix_1"].fillna( method ='ffill', inplace = True)
In [143]:
project_data_train = pd.merge(project_data_train, category_1, on='clean_prefix', how='left')
project_data_train.drop(['clean_prefix'],axis=1, inplace=True)
#project_data_train.head(1)
In [144]:
Cat_0 = list(project_data_train['Prefix_0'].values)
Category_Class_0 = []
for i in Cat_0:
    temp = ""
    
    for j in str(i).split(','): # it will split it in parts 
        if 'The' in j.split(): # this will split each of the state based on space 
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) 
        j = j.replace("NaN",'0')
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    Category_Class_0.append(temp.strip())
    
project_data_train['Prefix_Class_0'] = Category_Class_0
project_data_train.drop(['Prefix_0'], axis=1, inplace=True)
In [145]:
Cat_1 = list(project_data_train['Prefix_1'].values)
Category_Class_1 = []
for i in Cat_1:
    temp = ""
    
    for j in str(i).split(','): # it will split it in parts 
        if 'The' in j.split(): # this will split each of the state based on space 
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) 
        j = j.replace("NaN",'0')
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    Category_Class_1.append(temp.strip())
    
project_data_train['Prefix_Class_1'] = Category_Class_1
project_data_train.drop(['Prefix_1'], axis=1, inplace=True)

Hot Encoding Teacher Prefix - Test Data

In [146]:
prefix = list(project_data_test['teacher_prefix'].values)
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039

# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python

prefix_list_test = []
for i in prefix:
    temp = ""
    
    for j in str(i).split(','): # it will split it in parts 
        if 'The' in j.split(): # this will split each of the state based on space 
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) 
        j = j.replace("nan",'')
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    prefix_list_test.append(temp.strip())
In [147]:
project_data_test['clean_prefix'] = prefix_list_test
project_data_test.drop(['teacher_prefix'], axis=1, inplace=True)
In [148]:
unique_list_test = []
for x in prefix_list_test:
    if x not in unique_list_test: 
            unique_list_test.append(x)
            
#https://stackoverflow.com/questions/41125909/python-find-elements-in-one-list-that-are-not-in-the-other       

difference=list(set(unique_list_test).difference(unique_list))
print(difference)
['']
In [150]:
df1=pd.DataFrame([[' '' ' ,0.5,0.5]],columns=['clean_prefix','Prefix_0','Prefix_1'])
In [151]:
category_1=category_1.append(df1, ignore_index = True)
In [152]:
project_data_test = pd.merge(project_data_test, category_1, on='clean_prefix', how='left')
project_data_test.drop(['clean_prefix'],axis=1, inplace=True)
In [153]:
Prefix_0 = list(project_data_test['Prefix_0'].values)

Prefix_0 = []
for i in Prefix_0:
    temp = ""
    
    for j in str(i).split(','): # it will split it in parts 
        if 'The' in j.split(): # this will split each of the state based on space 
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) 
        j = j.replace("nan",'0')
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    Prefix_0.append(temp.strip())
    
project_data_test['Prefix_Class_0'] = State_Class_0
project_data_test.drop(['Prefix_0'], axis=1, inplace=True)
In [154]:
Prefix_1 = list(project_data_test['Prefix_1'].values)

Prefix_1 = []
for i in Prefix_1:
    temp = ""
    
    for j in str(i).split(','): # it will split it in parts 
        if 'The' in j.split(): # this will split each of the state based on space 
            j=j.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
        j = j.replace(' ','') # we are placeing all the ' '(space) with ''(empty) 
        j = j.replace("nan",'0')
        temp +=j.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
        temp = temp.replace('&','_')
    Prefix_1.append(temp.strip())
    
project_data_test['Prefix_Class_1'] = State_Class_1
project_data_test.drop(['Prefix_1'], axis=1, inplace=True)

Vectorizing Text data

Bag of words - Train Data

In [156]:
# We are considering only the words which appeared in at least 10 documents(rows or projects).
vectorizer6 = CountVectorizer(min_df=10, lowercase=False, binary=True)
text_bow = vectorizer6.fit_transform(preprocessed_essays)
print("Shape of matrix after one hot encodig ",text_bow.shape)

#print(text_bow)
Shape of matrix after one hot encodig  (33500, 10328)

Bag of Words Title - Train Data

In [157]:
vectorizer7=CountVectorizer(lowercase=False, binary=True, min_df=0)
title_bow = vectorizer7.fit_transform(preprocessed_title)
print("Shape of matrix after one hot encoding ",title_bow.shape)
Shape of matrix after one hot encoding  (33500, 9682)

Bag of Words Essay- Test Data

In [158]:
# We are considering only the words which appeared in at least 10 documents(rows or projects).
#vectorizer = CountVectorizer(min_df=10, ngram_range=(2,2), lowercase=False, binary=True, max_features=5000, )
text_bow_test = vectorizer6.transform(preprocessed_essays_test)
print("Shape of matrix after one hot encodig ",text_bow_test.shape)
Shape of matrix after one hot encodig  (16500, 10328)

Bag of Words Words Tittle - Test Data

In [159]:
# We are considering only the words which appeared in at least 10 documents(rows or titles).
#vectorizer = CountVectorizer(vocabulary=list(sorted_title_dict.keys()), lowercase=False, binary=True, min_df=0)
title_bow_test = vectorizer7.transform(preprocessed_title_test)
print("Shape of matrix after one hot encoding ",title_bow_test.shape)
Shape of matrix after one hot encoding  (16500, 9682)

TFIDF vectorizer Essays - Train Data

In [160]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer8 = TfidfVectorizer(min_df=10,lowercase=False, binary=True, max_features=5000)
text_tfidf = vectorizer8.fit_transform(preprocessed_essays)
print("Shape of matrix after one hot encodig ",text_tfidf.shape)
Shape of matrix after one hot encodig  (33500, 5000)

TFIDF Vectorizer Tittle - Train Data

In [161]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer9 = TfidfVectorizer(min_df=0, lowercase=False, binary=True, max_features=5000)
tittle_tfidf = vectorizer9.fit_transform(preprocessed_title)
print("Shape of matrix after one hot encoding ",tittle_tfidf.shape)
Shape of matrix after one hot encoding  (33500, 5000)

TFIDF Vectorizer Essay - Test Data

In [162]:
#from sklearn.feature_extraction.text import TfidfVectorizer
#vectorizer = TfidfVectorizer(vocabulary=sorted_essays_dict.keys(), lowercase=False, binary=True, min_df=10)
vectorizer8.fit(preprocessed_essays_test)
text_tfidf_test = vectorizer8.transform(preprocessed_essays_test)
print("Shape of matrix after one hot encodig ",text_tfidf_test.shape)
Shape of matrix after one hot encodig  (16500, 5000)

TFIDF Vectorizer Tittle - Test Data

In [163]:
#from sklearn.feature_extraction.text import TfidfVectorizer
#vectorizer = TfidfVectorizer(vocabulary=list(sorted_title_dict.keys()), lowercase=False, binary=True, min_df=0)
vectorizer9.fit(preprocessed_title_test)
title_tfidf_test = vectorizer9.transform(preprocessed_title_test)
print("Shape of matrix after one hot encodig ",title_tfidf_test.shape)
Shape of matrix after one hot encodig  (16500, 5000)

1.5.2.3 Using Pretrained Models: Avg W2V

In [164]:
# Reading glove vectors in python: https://stackoverflow.com/a/38230349/4084039
def loadGloveModel(gloveFile):
    print ("Loading Glove Model")
    f = open(gloveFile,'r', encoding="utf8")
    model = {}
    for line in tqdm(f):
        splitLine = line.split()
        word = splitLine[0]
        embedding = np.array([float(val) for val in splitLine[1:]])
        model[word] = embedding
    print ("Done.",len(model)," words loaded!")
    return model
model = loadGloveModel('glove.42B.300d.txt')

words = []
for i in preprocessed_essays:
    words.extend(i.split(' '))

for i in preprocessed_title:
    words.extend(i.split(' '))
print("all the words in the coupus", len(words))
words = set(words)
print("the unique words in the coupus", len(words))

inter_words = set(model.keys()).intersection(words)
print("The number of words that are present in both glove vectors and our coupus", \
      len(inter_words),"(",np.round(len(inter_words)/len(words)*100,3),"%)")

words_courpus = {}
words_glove = set(model.keys())
for i in words:
    if i in words_glove:
        words_courpus[i] = model[i]
print("word 2 vec length", len(words_courpus))


# stronging variables into pickle files python: http://www.jessicayung.com/how-to-use-pickle-to-save-and-load-variables-in-python/

import pickle
with open('glove_vectors', 'wb') as f:
    pickle.dump(words_courpus, f)
Loading Glove Model
1917495it [04:05, 7806.48it/s]
Done. 1917495  words loaded!
all the words in the coupus 4772978
the unique words in the coupus 36469
The number of words that are present in both glove vectors and our coupus 33834 ( 92.775 %)
word 2 vec length 33834
In [165]:
# stronging variables into pickle files python: http://www.jessicayung.com/how-to-use-pickle-to-save-and-load-variables-in-python/
# make sure you have the glove_vectors file
with open('glove_vectors', 'rb') as f:
    model = pickle.load(f)
    glove_words =  set(model.keys())

AVG_W2V_Vectors Essays- Train Data -

In [166]:
# average Word2Vec
# compute average word2vec for each review.
avg_w2v_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(preprocessed_essays): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    avg_w2v_vectors.append(vector)

print(len(avg_w2v_vectors))
print(len(avg_w2v_vectors[0]))
100%|██████████████████████████████████████████████████████████████████████████| 33500/33500 [00:09<00:00, 3633.63it/s]
33500
300

AVG_W2V_Vectors Tittle - Train Data -

In [167]:
# average Word2Vec
# compute average word2vec for each review.
avg_w2v_vectors_tittle = []; # the avg-w2v for each title is stored in this list
for sentence in tqdm(preprocessed_title): # for each title
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the title
    for word in sentence.split(): # for each word in a title
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    avg_w2v_vectors_tittle.append(vector)

print(len(avg_w2v_vectors_tittle))
print(len(avg_w2v_vectors_tittle[0]))
100%|█████████████████████████████████████████████████████████████████████████| 33500/33500 [00:00<00:00, 46035.71it/s]
33500
300

AVG W2V on Essays- Test Data

In [168]:
# average Word2Vec
# compute average word2vec for each review.
avg_w2v_vectors_test = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(preprocessed_essays_test): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    avg_w2v_vectors_test.append(vector)

print(len(avg_w2v_vectors_test))
print(len(avg_w2v_vectors_test[0]))
100%|██████████████████████████████████████████████████████████████████████████| 16500/16500 [00:08<00:00, 2001.70it/s]
16500
300

AVG W2V of Tittle Test Data

In [169]:
# average Word2Vec
# compute average word2vec for each review.
avg_w2v_vectors_tittle_test = []; # the avg-w2v for each title is stored in this list
for sentence in tqdm(preprocessed_title_test): # for each title
    vector = np.zeros(300) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the title
    for word in sentence.split(): # for each word in a title
        if word in glove_words:
            vector += model[word]
            cnt_words += 1
    if cnt_words != 0:
        vector /= cnt_words
    avg_w2v_vectors_tittle_test.append(vector)

print(len(avg_w2v_vectors_tittle_test))
print(len(avg_w2v_vectors_tittle_test[0]))
100%|█████████████████████████████████████████████████████████████████████████| 16500/16500 [00:00<00:00, 24404.94it/s]
16500
300

Using Pretrained Models: TFIDF weighted W2V for Train Data - Essays

In [170]:
# S = ["abc def pqr", "def def def abc", "pqr pqr def"]
tfidf_model_essays = TfidfVectorizer()
tfidf_model_essays.fit_transform(preprocessed_essays)
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(tfidf_model_essays.get_feature_names(), list(tfidf_model_essays.idf_)))
tfidf_words = set(tfidf_model_essays.get_feature_names())
In [171]:
# average Word2Vec
# compute average word2vec for each review.
tfidf_w2v_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(preprocessed_essays): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    tfidf_w2v_vectors.append(vector)

print(len(tfidf_w2v_vectors))
print(len(tfidf_w2v_vectors[0]))
100%|███████████████████████████████████████████████████████████████████████████| 33500/33500 [01:07<00:00, 573.28it/s]
33500
300

Using Pretrained Models: TFIDF weighted W2V on `project_title` for Train Data

In [172]:
# S = ["abc def pqr", "def def def abc", "pqr pqr def"]
tfidf_model_title = TfidfVectorizer()
tfidf_model_title.fit_transform(preprocessed_title)
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(tfidf_model_title.get_feature_names(), list(tfidf_model_title.idf_)))
tfidf_words = set(tfidf_model_title.get_feature_names())
In [173]:
# average Word2Vec
# compute average word2vec for each review.
tfidf_w2v_vectors_Title = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(preprocessed_title): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    tfidf_w2v_vectors_Title.append(vector)

print(len(tfidf_w2v_vectors_Title))
print(len(tfidf_w2v_vectors_Title[0]))
100%|█████████████████████████████████████████████████████████████████████████| 33500/33500 [00:00<00:00, 33962.45it/s]
33500
300

TFIDF weighted W2V Essays for Test Data

In [174]:
# S = ["abc def pqr", "def def def abc", "pqr pqr def"]
#vectorizer = TfidfVectorizer(vocabulary=sorted_essays_dict.keys(), lowercase=False, binary=True, min_df=10)
tfidf_model_essays.fit(preprocessed_essays_test)
tfidf_model_essays.transform(preprocessed_essays_test)
# we are converting a dictionary with word as a key, and the idf as a value
#dictionary = dict(zip(tfidf_model_essays.get_feature_names(), list(tfidf_model_essays.idf_)))
#tfidf_words = set(tfidf_model_essays.get_feature_names())
Out[174]:
<16500x27516 sparse matrix of type '<class 'numpy.float64'>'
	with 1777953 stored elements in Compressed Sparse Row format>
In [175]:
# average Word2Vec
# compute average word2vec for each review.
tfidf_w2v_vectors_test = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(preprocessed_essays_test): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    tfidf_w2v_vectors_test.append(vector)

print(len(tfidf_w2v_vectors_test))
print(len(tfidf_w2v_vectors_test[0]))
100%|███████████████████████████████████████████████████████████████████████████| 16500/16500 [00:29<00:00, 564.77it/s]
16500
300

TFIDF Weighted AVG_W2V Title for Test Data

In [176]:
# S = ["abc def pqr", "def def def abc", "pqr pqr def"]
#tfidf_model = TfidfVectorizer(vocabulary=sorted_title_dict.keys(), lowercase=False, binary=True, min_df=0)
tfidf_model_title.fit(preprocessed_title_test)
tfidf_model_title.transform(preprocessed_title_test)
# we are converting a dictionary with word as a key, and the idf as a value
#dictionary = dict(zip(tfidf_model_title.get_feature_names(), list(tfidf_model_title.idf_)))
#tfidf_words = set(tfidf_model_title.get_feature_names())
Out[176]:
<16500x6971 sparse matrix of type '<class 'numpy.float64'>'
	with 68576 stored elements in Compressed Sparse Row format>
In [177]:
# average Word2Vec
# compute average word2vec for each review.
tfidf_w2v_vectors_Title_test = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(preprocessed_title_test): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    tfidf_w2v_vectors_Title_test.append(vector)

print(len(tfidf_w2v_vectors_Title_test))
print(len(tfidf_w2v_vectors_Title_test[0]))
100%|█████████████████████████████████████████████████████████████████████████| 16500/16500 [00:00<00:00, 32492.06it/s]
16500
300

Vectorizing Numerical features for Train Data

In [178]:
price_data = resource_data.groupby('id').agg({'price':'sum', 'quantity':'sum'}).reset_index()
project_data_train = pd.merge(project_data_train, price_data, on='id', how='left')
In [179]:
# check this one: https://www.youtube.com/watch?v=0HOqOcln3Z4&t=530s
# standardization sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler

# price_standardized = standardScalar.fit(project_data['price'].values)
# this will rise the error
# ValueError: Expected 2D array, got 1D array instead: array=[725.05 213.03 329.   ... 399.   287.73   5.5 ].
# Reshape your data either using array.reshape(-1, 1)

price_scalar = StandardScaler()
price_scalar.fit_transform(project_data_train['price'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
print(f"Mean : {price_scalar.mean_[0]}, Standard deviation : {np.sqrt(price_scalar.var_[0])}")

# Now standardize the data with above maen and variance.
price_standardized = price_scalar.transform(project_data_train['price'].values.reshape(-1, 1))
Mean : 299.8851808955224, Standard deviation : 380.6672289445736
In [180]:
price_standardized.shape
Out[180]:
(33500, 1)

Vectorizing Quantity - Train Data

In [181]:
import warnings
warnings.filterwarnings("ignore")


quantity_scalar = StandardScaler()
quantity_scalar.fit_transform(project_data_train['quantity'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
print(f"Mean : {price_scalar.mean_[0]}, Standard deviation : {np.sqrt(price_scalar.var_[0])}")

# Now standardize the data with above maen and variance.
quantity_standardized = quantity_scalar.transform(project_data_train['quantity'].values.reshape(-1, 1))
Mean : 299.8851808955224, Standard deviation : 380.6672289445736
In [182]:
quantity_standardized
Out[182]:
array([[-0.48926166],
       [-0.56420641],
       [-0.60167878],
       ...,
       [-0.48926166],
       [-0.60167878],
       [-0.52673404]])

Verctorizing for Teacher Previously Posted Projected for Train Data -

In [183]:
import warnings
warnings.filterwarnings("ignore")

teacher_number_of_previously_posted_projects_scalar = StandardScaler()
teacher_number_of_previously_posted_projects_scalar.fit_transform(project_data_train['teacher_number_of_previously_posted_projects'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
print(f"Mean : {teacher_number_of_previously_posted_projects_scalar.mean_[0]}, Standard deviation : {np.sqrt(teacher_number_of_previously_posted_projects_scalar.var_[0])}")

# Now standardize the data with above maen and variance.
teacher_number_of_previously_posted_projects_standardized = teacher_number_of_previously_posted_projects_scalar.transform(project_data_train['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1))
Mean : 11.197880597014926, Standard deviation : 27.755630462553043
In [184]:
teacher_number_of_previously_posted_projects_standardized
Out[184]:
array([[-0.36741664],
       [ 1.28990474],
       [ 0.17301424],
       ...,
       [-0.18727301],
       [ 0.02889934],
       [-0.18727301]])

Vectorizing Numerical features for Test Data

In [185]:
price_data = resource_data.groupby('id').agg({'price':'sum', 'quantity':'sum'}).reset_index()
project_data_test = pd.merge(project_data_test, price_data, on='id', how='left')
In [186]:
# check this one: https://www.youtube.com/watch?v=0HOqOcln3Z4&t=530s
# standardization sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler

# price_standardized = standardScalar.fit(project_data['price'].values)
# this will rise the error
# ValueError: Expected 2D array, got 1D array instead: array=[725.05 213.03 329.   ... 399.   287.73   5.5 ].
# Reshape your data either using array.reshape(-1, 1)

#price_scalar = StandardScaler()
price_scalar.fit(project_data_test['price'].values.reshape(-1,1))
price_scalar.transform(project_data_test['price'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
print(f"Mean : {price_scalar.mean_[0]}, Standard deviation : {np.sqrt(price_scalar.var_[0])}")

# Now standardize the data with above maen and variance.
price_standardized_test = price_scalar.transform(project_data_test['price'].values.reshape(-1, 1))
Mean : 298.21395454545456, Standard deviation : 373.16656677234437
In [187]:
price_standardized_test.shape
Out[187]:
(16500, 1)

Vectorizing Quantity - Test

In [188]:
#quantity_scalar = StandardScaler()
quantity_scalar.fit_transform(project_data_test['quantity'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
print(f"Mean : {quantity_scalar.mean_[0]}, Standard deviation : {np.sqrt(quantity_scalar.var_[0])}")

# Now standardize the data with above maen and variance.
quantity_test_standardized = quantity_scalar.transform(project_data_test['quantity'].values.reshape(-1, 1))
Mean : 17.056060606060605, Standard deviation : 27.052193261152617
In [189]:
quantity_test_standardized.shape
Out[189]:
(16500, 1)

Verctorizing for Teacher Previously Posted Projected for Test Data -

In [190]:
import warnings
warnings.filterwarnings("ignore")

teacher_number_of_previously_posted_projects_scalar.fit(project_data_test['teacher_number_of_previously_posted_projects'].values.reshape(-1,1))
teacher_number_of_previously_posted_projects_scalar.transform(project_data_test['teacher_number_of_previously_posted_projects'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
print(f"Mean : {teacher_number_of_previously_posted_projects_scalar.mean_[0]}, Standard deviation : {np.sqrt(teacher_number_of_previously_posted_projects_scalar.var_[0])}")

# Now standardize the data with above maen and variance.
teacher_number_of_previously_posted_projects_standardized_test = teacher_number_of_previously_posted_projects_scalar.transform(project_data_test['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1))
Mean : 11.353818181818182, Standard deviation : 28.954047615148312
In [191]:
teacher_number_of_previously_posted_projects_standardized_test.shape
Out[191]:
(16500, 1)
In [192]:
project_data_train.columns
Out[192]:
Index(['index', 'id', 'teacher_id', 'project_submitted_datetime',
       'project_title', 'project_essay_1', 'project_essay_2',
       'project_essay_3', 'project_essay_4', 'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'labels', 'Category_0',
       'Category_1', 'level_0', 'SubCategory_0', 'SubCategory_1', 'essay',
       'State_Class_0', 'State_Class_1', 'Grade_0', 'Grade_1',
       'Prefix_Class_0', 'Prefix_Class_1', 'price', 'quantity'],
      dtype='object')
In [193]:
import warnings
warnings.filterwarnings("ignore")

Category_0 = StandardScaler()
Category_0.fit_transform(project_data_train['Category_0'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
#print(f"Mean : {essays.mean_[0]}, Standard deviation : {np.sqrt(essays.var_[0])}")

# Now standardize the data with above maen and variance.
Category_0_train_standardized = Category_0.transform(project_data_train['Category_0'].values.reshape(-1, 1))
In [194]:
Category_1 = StandardScaler()
Category_1.fit_transform(project_data_train['Category_1'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
#print(f"Mean : {essays.mean_[0]}, Standard deviation : {np.sqrt(essays.var_[0])}")

# Now standardize the data with above maen and variance.
Category_1_train_standardized = Category_1.transform(project_data_train['Category_0'].values.reshape(-1, 1))
In [195]:
SubCat_1 = StandardScaler()
SubCat_1.fit_transform(project_data_train['SubCategory_1'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
#print(f"Mean : {essays.mean_[0]}, Standard deviation : {np.sqrt(essays.var_[0])}")

# Now standardize the data with above maen and variance.
SubCat_1_train_standardized = SubCat_1.transform(project_data_train['SubCategory_1'].values.reshape(-1, 1))
In [196]:
SubCat_0 = StandardScaler()
SubCat_0.fit_transform(project_data_train['SubCategory_0'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
#print(f"Mean : {essays.mean_[0]}, Standard deviation : {np.sqrt(essays.var_[0])}")

# Now standardize the data with above maen and variance.
SubCat_0_train_standardized = SubCat_0.transform(project_data_train['SubCategory_0'].values.reshape(-1, 1))
In [197]:
State_0 = StandardScaler()
State_0.fit_transform(project_data_train['State_Class_0'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
#print(f"Mean : {essays.mean_[0]}, Standard deviation : {np.sqrt(essays.var_[0])}")

# Now standardize the data with above maen and variance.
State_0_train_standardized = State_0.transform(project_data_train['State_Class_0'].values.reshape(-1, 1))
In [198]:
State_1 = StandardScaler()
State_1.fit_transform(project_data_train['State_Class_1'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
#print(f"Mean : {essays.mean_[0]}, Standard deviation : {np.sqrt(essays.var_[0])}")

# Now standardize the data with above maen and variance.
State_1_standardized = State_1.transform(project_data_train['State_Class_1'].values.reshape(-1, 1))
In [199]:
Grade_1 = StandardScaler()
Grade_1.fit_transform(project_data_train['Grade_0'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
#print(f"Mean : {essays.mean_[0]}, Standard deviation : {np.sqrt(essays.var_[0])}")

# Now standardize the data with above maen and variance.
Grade_1_standardized = Grade_1.transform(project_data_train['Grade_0'].values.reshape(-1, 1))
In [200]:
Grade_0 = StandardScaler()
Grade_0.fit_transform(project_data_train['Grade_1'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
#print(f"Mean : {essays.mean_[0]}, Standard deviation : {np.sqrt(essays.var_[0])}")

# Now standardize the data with above maen and variance.
Grade_0_standardized = Grade_0.transform(project_data_train['Grade_1'].values.reshape(-1, 1))
In [201]:
Prefix_0 = StandardScaler()
Prefix_0.fit_transform(project_data_train['Prefix_Class_0'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
#print(f"Mean : {essays.mean_[0]}, Standard deviation : {np.sqrt(essays.var_[0])}")

# Now standardize the data with above maen and variance.
Prefix_0_standardized = Prefix_0.transform(project_data_train['Prefix_Class_0'].values.reshape(-1, 1))
In [202]:
Prefix_1 = StandardScaler()
Prefix_1.fit_transform(project_data_train['Prefix_Class_1'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
#print(f"Mean : {essays.mean_[0]}, Standard deviation : {np.sqrt(essays.var_[0])}")

# Now standardize the data with above maen and variance.
Prefix_1_standardized = Prefix_1.transform(project_data_train['Prefix_Class_0'].values.reshape(-1, 1))

Merging Features for BoW

In [231]:
# merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
from scipy.sparse import hstack
# with the same hstack function we are concatinating a sparse matrix and a dense matirx :)
X_train = hstack((Category_1_train_standardized,Category_0_train_standardized,SubCat_1_train_standardized,SubCat_0_train_standardized,State_0_train_standardized,Grade_1_standardized,Grade_0_standardized,State_1_standardized,Prefix_0_standardized,Prefix_1_standardized,teacher_number_of_previously_posted_projects_standardized,price_standardized,title_bow,text_bow)).tocsr()  #https://www.kaggle.com/c/quora-question-pairs/discussion/33491 taken from
X_train.shape
Out[231]:
(33500, 20022)
In [232]:
project_data_test.columns
Out[232]:
Index(['index', 'id', 'teacher_id', 'project_submitted_datetime',
       'project_title', 'project_essay_1', 'project_essay_2',
       'project_essay_3', 'project_essay_4', 'project_resource_summary',
       'teacher_number_of_previously_posted_projects', 'Category_0',
       'Category_1', 'clean_subcategories', 'SubCategory_0', 'SubCategory_1',
       'essay', 'State_Class_0', 'State_Class_1', 'Grade_Class_0',
       'Grade_Class_1', 'Prefix_Class_0', 'Prefix_Class_1', 'price',
       'quantity'],
      dtype='object')
In [233]:
Category_1_test_standardized = Category_1.transform(project_data_test['Category_1'].values.reshape(-1, 1))
Category_0_test_standardized = Category_0.transform(project_data_test['Category_0'].values.reshape(-1, 1))
SubCat_0_test_standardized = SubCat_0.transform(project_data_test['SubCategory_0'].values.reshape(-1, 1))
SubCat_1_test_standardized = SubCat_1.transform(project_data_test['SubCategory_1'].values.reshape(-1, 1))
State_0_test_standardized = State_0.transform(project_data_test['State_Class_0'].values.reshape(-1, 1))
State_1_test_standardized = State_1.transform(project_data_test['State_Class_1'].values.reshape(-1, 1))
Grade_0_test_standardized = Grade_0.transform(project_data_test['Grade_Class_0'].values.reshape(-1, 1))
Grade_1_test_standardized = Grade_1.transform(project_data_test['Grade_Class_1'].values.reshape(-1, 1))
Prefix_0_test_standardized = Prefix_0.transform(project_data_test['Prefix_Class_0'].values.reshape(-1, 1))
Prefix_1_test_standardized = Prefix_1.transform(project_data_test['Prefix_Class_1'].values.reshape(-1, 1))
In [234]:
X_test = hstack((Category_1_test_standardized,Category_0_test_standardized,SubCat_0_test_standardized,SubCat_1_test_standardized,State_0_test_standardized,State_1_test_standardized,Grade_0_test_standardized,Grade_1_test_standardized,Prefix_0_test_standardized,Prefix_1_test_standardized,teacher_number_of_previously_posted_projects_standardized_test,price_standardized_test,text_bow_test,title_bow_test)).tocsr()  #https://www.kaggle.com/c/quora-question-pairs/discussion/33491 taken from
X_test.shape
Out[234]:
(16500, 20022)

RF for BoW

In [209]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import cross_val_score
from collections import Counter
from sklearn.metrics import accuracy_score
from sklearn import cross_validation
from math import log
In [210]:
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
from sklearn.model_selection import GridSearchCV

C = RandomForestClassifier()

n_estimators=[10,50,100,500,1000]
max_depth=[1, 5, 10, 50, 100, 500, 100]

import math

log_max_depth = [math.log10(x) for x in max_depth]
log_n_estimators=[math.log10(x) for x in n_estimators]

print("Printing parameter Data and Corresponding Log value for Max Depth")
data={'Parameter value':max_depth,'Corresponding Log Value':log_max_depth}
param=pd.DataFrame(data)
print("="*100)
print(param)

print("Printing parameter Data and Corresponding Log value for Estimators")
data={'Parameter value':n_estimators,'Corresponding Log Value':log_n_estimators}
param=pd.DataFrame(data)
print("="*100)
print(param)

parameters = {'n_estimators':n_estimators, 'max_depth':max_depth}
clf = GridSearchCV(C, parameters, cv=3, scoring='roc_auc',n_jobs=-1)
clf.fit(X_train, labels_train)

#data={'Parameter value':[0.0001,0.001,0.01,0.1,1,5,10,20,30,40],'Corresponding Log Value':[log_my_data]}

train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']
cv_auc = clf.cv_results_['mean_test_score'] 
cv_auc_std= clf.cv_results_['std_test_score']
Printing parameter Data and Corresponding Log value for Max Depth
====================================================================================================
   Parameter value  Corresponding Log Value
0                1                  0.00000
1                5                  0.69897
2               10                  1.00000
3               50                  1.69897
4              100                  2.00000
5              500                  2.69897
6              100                  2.00000
Printing parameter Data and Corresponding Log value for Estimators
====================================================================================================
   Parameter value  Corresponding Log Value
0               10                  1.00000
1               50                  1.69897
2              100                  2.00000
3              500                  2.69897
4             1000                  3.00000
In [211]:
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
import numpy as np

# https://plot.ly/python/3d-axes/
trace1 = go.Scatter3d(x=log_n_estimators,y=log_max_depth,z=train_auc, name = 'train')
trace2 = go.Scatter3d(x=log_n_estimators,y=log_max_depth,z=cv_auc, name = 'Cross validation')
data = [trace1, trace2]

layout = go.Layout(scene = dict(
        xaxis = dict(title='n_estimators'),
        yaxis = dict(title='max_depth'),
        zaxis = dict(title='AUC'),))

fig = go.Figure(data=data, layout=layout)
offline.iplot(fig, filename='3d-scatter-colorscale')
In [212]:
def model_predict(clf, data):
    # roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
    # not the predicted outputs

    y_data_pred = []
    y_data_pred.extend(clf.predict_proba(data[:])[:,1])
  
    return y_data_pred
In [262]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc
#from sklearn.calibration import CalibratedClassifierCV


neigh = RandomForestClassifier(n_estimators=100,max_depth=5,class_weight='balanced')
neigh.fit(X_train, labels_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs

y_train_pred = model_predict(neigh, X_train)   
y_test_pred = model_predict(neigh, X_test)


train_fpr, train_tpr, tr_thresholds = roc_curve(labels_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(labels_test, y_test_pred)

plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("K: hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()
In [248]:
# we are writing our own function for predict, with defined thresould
# we will pick a threshold that will give the least fpr
def predict(proba, threshould, fpr, tpr):
    
    t = threshould[np.argmax(tpr*(1-fpr))]
    
    # (tpr*(1-fpr)) will be maximum if your fpr is very low and tpr is very high
    
    print("the maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
    predictions = []
    for i in proba:
        if i>=t:
            predictions.append(1)
        else:
            predictions.append(0)
    return predictions
In [249]:
print("="*100)
from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
cm=confusion_matrix(labels_train, predict(y_train_pred, tr_thresholds, train_fpr, train_fpr))
sns.heatmap(cm, annot=True, fmt="d" )
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title("Confusion Matrix for Train")
====================================================================================================
Train confusion matrix
the maximum value of tpr*(1-fpr) 0.25 for threshold 0.484
Out[249]:
Text(0.5,1,'Confusion Matrix for Train')
In [254]:
print("Test confusion matrix")
cm1=confusion_matrix(labels_test, predict(y_test_pred, tr_thresholds, test_fpr, test_fpr))
sns.heatmap(cm1, annot=True,fmt="d")
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title("Confusion Matrix for Test")
Test confusion matrix
the maximum value of tpr*(1-fpr) 0.2499605067234218 for threshold 0.857
Out[254]:
Text(0.5,1,'Confusion Matrix for Test')

GBDT for BoW

In [217]:
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

C = GradientBoostingClassifier()

n_estimators=[10,50,100,500,1000]
max_depth=[1, 5, 10, 50, 100, 500, 100]

import math

log_max_depth = [math.log10(x) for x in max_depth]
log_n_estimators=[math.log10(x) for x in n_estimators]

print("Printing parameter Data and Corresponding Log value for Max Depth")
data={'Parameter value':max_depth,'Corresponding Log Value':log_max_depth}
param=pd.DataFrame(data)
print("="*100)
print(param)

print("Printing parameter Data and Corresponding Log value for Estimators")
data={'Parameter value':n_estimators,'Corresponding Log Value':log_n_estimators}
param=pd.DataFrame(data)
print("="*100)
print(param)

parameters = {'n_estimators':n_estimators, 'max_depth':max_depth}
clf = GridSearchCV(C, parameters, cv=3, scoring='roc_auc',n_jobs=-1)
clf.fit(X_train, labels_train)

#data={'Parameter value':[0.0001,0.001,0.01,0.1,1,5,10,20,30,40],'Corresponding Log Value':[log_my_data]}

train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']
cv_auc = clf.cv_results_['mean_test_score'] 
cv_auc_std= clf.cv_results_['std_test_score']
Printing parameter Data and Corresponding Log value for Max Depth
====================================================================================================
   Parameter value  Corresponding Log Value
0                1                  0.00000
1                5                  0.69897
2               10                  1.00000
3               50                  1.69897
4              100                  2.00000
5              500                  2.69897
6              100                  2.00000
Printing parameter Data and Corresponding Log value for Estimators
====================================================================================================
   Parameter value  Corresponding Log Value
0               10                  1.00000
1               50                  1.69897
2              100                  2.00000
3              500                  2.69897
4             1000                  3.00000
In [218]:
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
import numpy as np

# https://plot.ly/python/3d-axes/
trace1 = go.Scatter3d(x=log_n_estimators,y=log_max_depth,z=train_auc, name = 'train')
trace2 = go.Scatter3d(x=log_n_estimators,y=log_max_depth,z=cv_auc, name = 'Cross validation')
data = [trace1, trace2]

layout = go.Layout(scene = dict(
        xaxis = dict(title='n_estimators'),
        yaxis = dict(title='max_depth'),
        zaxis = dict(title='AUC'),))

fig = go.Figure(data=data, layout=layout)
offline.iplot(fig, filename='3d-scatter-colorscale')
In [260]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc
from sklearn.ensemble import GradientBoostingClassifier


neigh = GradientBoostingClassifier(n_estimators=50,max_depth=1)
neigh.fit(X_train, labels_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs

y_train_pred = model_predict(neigh, X_train)   
y_test_pred = model_predict(neigh, X_test)


train_fpr, train_tpr, tr_thresholds = roc_curve(labels_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(labels_test, y_test_pred)

plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("K: hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show() 
In [258]:
print("="*100)
from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
cm=confusion_matrix(labels_train, predict(y_train_pred, tr_thresholds, train_fpr, train_fpr))
sns.heatmap(cm, annot=True, fmt="d" )
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title("Confusion Matrix for Train")
====================================================================================================
Train confusion matrix
the maximum value of tpr*(1-fpr) 0.25 for threshold 0.781
Out[258]:
Text(0.5,1,'Confusion Matrix for Train')
In [259]:
print("Test confusion matrix")
cm1=confusion_matrix(labels_test, predict(y_test_pred, tr_thresholds, test_fpr, test_fpr))
sns.heatmap(cm1, annot=True,fmt="d")
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title("Confusion Matrix for Test")
Test confusion matrix
the maximum value of tpr*(1-fpr) 0.24999614323470915 for threshold 0.863
Out[259]:
Text(0.5,1,'Confusion Matrix for Test')

Merging Features for TFIDF

In [277]:
# merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
from scipy.sparse import hstack
# with the same hstack function we are concatinating a sparse matrix and a dense matirx :)
X1_train = hstack((Category_1_train_standardized,Category_0_train_standardized,SubCat_1_train_standardized,SubCat_0_train_standardized,State_0_train_standardized,Grade_1_standardized,Grade_0_standardized,State_1_standardized,Prefix_0_standardized,Prefix_1_standardized,teacher_number_of_previously_posted_projects_standardized,price_standardized,text_tfidf,tittle_tfidf)).tocsr()  #https://www.kaggle.com/c/quora-question-pairs/discussion/33491 taken from
X1_train.shape
Out[277]:
(33500, 10012)
In [278]:
X1_test = hstack((Category_1_test_standardized,Category_0_test_standardized,SubCat_0_test_standardized,SubCat_1_test_standardized,State_0_test_standardized,State_1_test_standardized,Grade_0_test_standardized,Grade_1_test_standardized,Prefix_0_test_standardized,Prefix_1_test_standardized,teacher_number_of_previously_posted_projects_standardized_test,price_standardized_test,text_tfidf_test,title_tfidf_test)).tocsr()  #https://www.kaggle.com/c/quora-question-pairs/discussion/33491 taken from
X1_test.shape
Out[278]:
(16500, 10012)

RF for TFIDF

In [268]:
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
from sklearn.model_selection import GridSearchCV

C = RandomForestClassifier()

n_estimators=[10,50,100,500,1000]
max_depth=[1, 5, 10, 50, 100, 500, 100]

import math

log_max_depth = [math.log10(x) for x in max_depth]
log_n_estimators=[math.log10(x) for x in n_estimators]

print("Printing parameter Data and Corresponding Log value for Max Depth")
data={'Parameter value':max_depth,'Corresponding Log Value':log_max_depth}
param=pd.DataFrame(data)
print("="*100)
print(param)

print("Printing parameter Data and Corresponding Log value for Estimators")
data={'Parameter value':n_estimators,'Corresponding Log Value':log_n_estimators}
param=pd.DataFrame(data)
print("="*100)
print(param)

parameters = {'n_estimators':n_estimators, 'max_depth':max_depth}
clf = GridSearchCV(C, parameters, cv=3, scoring='roc_auc',n_jobs=-1)
clf.fit(X1_train, labels_train)

#data={'Parameter value':[0.0001,0.001,0.01,0.1,1,5,10,20,30,40],'Corresponding Log Value':[log_my_data]}

train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']
cv_auc = clf.cv_results_['mean_test_score'] 
cv_auc_std= clf.cv_results_['std_test_score']

import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
import numpy as np

# https://plot.ly/python/3d-axes/
trace1 = go.Scatter3d(x=log_n_estimators,y=log_max_depth,z=train_auc, name = 'train')
trace2 = go.Scatter3d(x=log_n_estimators,y=log_max_depth,z=cv_auc, name = 'Cross validation')
data = [trace1, trace2]

layout = go.Layout(scene = dict(
        xaxis = dict(title='n_estimators'),
        yaxis = dict(title='max_depth'),
        zaxis = dict(title='AUC'),))

fig = go.Figure(data=data, layout=layout)
offline.iplot(fig, filename='3d-scatter-colorscale')
Printing parameter Data and Corresponding Log value for Max Depth
====================================================================================================
   Parameter value  Corresponding Log Value
0                1                  0.00000
1                5                  0.69897
2               10                  1.00000
3               50                  1.69897
4              100                  2.00000
5              500                  2.69897
6              100                  2.00000
Printing parameter Data and Corresponding Log value for Estimators
====================================================================================================
   Parameter value  Corresponding Log Value
0               10                  1.00000
1               50                  1.69897
2              100                  2.00000
3              500                  2.69897
4             1000                  3.00000
In [281]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc
#from sklearn.calibration import CalibratedClassifierCV


neigh = RandomForestClassifier(n_estimators=100,max_depth=7,class_weight='balanced')
neigh.fit(X1_train, labels_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs

y_train_pred = model_predict(neigh, X1_train)   
y_test_pred = model_predict(neigh, X1_test)


train_fpr, train_tpr, tr_thresholds = roc_curve(labels_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(labels_test, y_test_pred)

plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("K: hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()
In [270]:
print("="*100)
from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
cm=confusion_matrix(labels_train, predict(y_train_pred, tr_thresholds, train_fpr, train_fpr))
sns.heatmap(cm, annot=True, fmt="d" )
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title("Confusion Matrix for Train")
====================================================================================================
Train confusion matrix
the maximum value of tpr*(1-fpr) 0.25 for threshold 0.491
Out[270]:
Text(0.5,1,'Confusion Matrix for Train')
In [271]:
print("Test confusion matrix")
cm1=confusion_matrix(labels_test, predict(y_test_pred, tr_thresholds, test_fpr, test_fpr))
sns.heatmap(cm1, annot=True,fmt="d")
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title("Confusion Matrix for Test")
Test confusion matrix
the maximum value of tpr*(1-fpr) 0.2499697629601198 for threshold 0.521
Out[271]:
Text(0.5,1,'Confusion Matrix for Test')

Gradient Boosting for TFIDF

In [272]:
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

C = GradientBoostingClassifier()

n_estimators=[10,50,100,500,1000]
max_depth=[1, 5, 10, 50, 100, 500, 100]

import math

log_max_depth = [math.log10(x) for x in max_depth]
log_n_estimators=[math.log10(x) for x in n_estimators]

print("Printing parameter Data and Corresponding Log value for Max Depth")
data={'Parameter value':max_depth,'Corresponding Log Value':log_max_depth}
param=pd.DataFrame(data)
print("="*100)
print(param)

print("Printing parameter Data and Corresponding Log value for Estimators")
data={'Parameter value':n_estimators,'Corresponding Log Value':log_n_estimators}
param=pd.DataFrame(data)
print("="*100)
print(param)

parameters = {'n_estimators':n_estimators, 'max_depth':max_depth}
clf = GridSearchCV(C, parameters, cv=3, scoring='roc_auc',n_jobs=-1)
clf.fit(X1_train, labels_train)

#data={'Parameter value':[0.0001,0.001,0.01,0.1,1,5,10,20,30,40],'Corresponding Log Value':[log_my_data]}

train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']
cv_auc = clf.cv_results_['mean_test_score'] 
cv_auc_std= clf.cv_results_['std_test_score']
Printing parameter Data and Corresponding Log value for Max Depth
====================================================================================================
   Parameter value  Corresponding Log Value
0                1                  0.00000
1                5                  0.69897
2               10                  1.00000
3               50                  1.69897
4              100                  2.00000
5              500                  2.69897
6              100                  2.00000
Printing parameter Data and Corresponding Log value for Estimators
====================================================================================================
   Parameter value  Corresponding Log Value
0               10                  1.00000
1               50                  1.69897
2              100                  2.00000
3              500                  2.69897
4             1000                  3.00000
In [273]:
# https://plot.ly/python/3d-axes/
trace1 = go.Scatter3d(x=log_n_estimators,y=log_max_depth,z=train_auc, name = 'train')
trace2 = go.Scatter3d(x=log_n_estimators,y=log_max_depth,z=cv_auc, name = 'Cross validation') 
data = [trace1, trace2]

layout = go.Layout(scene = dict(
        xaxis = dict(title='n_estimators'),
        yaxis = dict(title='max_depth'),
        zaxis = dict(title='AUC'),))

fig = go.Figure(data=data, layout=layout)
offline.iplot(fig, filename='3d-scatter-colorscale')
In [282]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc
#from sklearn.calibration import CalibratedClassifierCV


neigh = GradientBoostingClassifier(n_estimators=50,max_depth=2)
neigh.fit(X1_train, labels_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs

y_train_pred = model_predict(neigh, X1_train)   
y_test_pred = model_predict(neigh, X1_test)


train_fpr, train_tpr, tr_thresholds = roc_curve(labels_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(labels_test, y_test_pred)

plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("K: hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()
In [283]:
print("="*100)
from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
cm=confusion_matrix(labels_train, predict(y_train_pred, tr_thresholds, train_fpr, train_fpr))
sns.heatmap(cm, annot=True, fmt="d" )
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title("Confusion Matrix for Train")
====================================================================================================
Train confusion matrix
the maximum value of tpr*(1-fpr) 0.25 for threshold 0.824
Out[283]:
Text(0.5,1,'Confusion Matrix for Train')
In [284]:
print("Test confusion matrix")
cm1=confusion_matrix(labels_test, predict(y_test_pred, tr_thresholds, test_fpr, test_fpr))
sns.heatmap(cm1, annot=True,fmt="d")
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title("Confusion Matrix for Test")
Test confusion matrix
the maximum value of tpr*(1-fpr) 0.24998457293883664 for threshold 0.888
Out[284]:
Text(0.5,1,'Confusion Matrix for Test')

Merging Features for AVG W2V

In [307]:
from scipy.sparse import coo_matrix, hstack
X2_train = np.hstack(((Category_1_train_standardized),(Category_0_train_standardized),(Grade_1_standardized),(Grade_0_standardized),(State_0_train_standardized),(Prefix_0_standardized),(Prefix_1_standardized),(teacher_number_of_previously_posted_projects_standardized),(price_standardized),(SubCat_1_train_standardized),(SubCat_0_train_standardized),(State_1_standardized),avg_w2v_vectors,avg_w2v_vectors_tittle))
X2_train.shape
Out[307]:
(33500, 612)
In [308]:
X2_test = np.hstack(((Category_1_test_standardized),(Category_0_test_standardized),(SubCat_0_test_standardized),(SubCat_1_test_standardized),(State_0_test_standardized),(State_1_test_standardized),(Grade_0_test_standardized),(Grade_1_test_standardized),(Prefix_0_test_standardized),(Prefix_1_test_standardized),(teacher_number_of_previously_posted_projects_standardized_test),(price_standardized_test),avg_w2v_vectors_test,avg_w2v_vectors_tittle_test))
X2_test.shape
Out[308]:
(16500, 612)

RF for AVG W2V

In [287]:
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
from sklearn.model_selection import GridSearchCV

C = RandomForestClassifier()

n_estimators=[10,50,100,200,500]
max_depth=[1, 5, 10, 50, 100, 500, 100]

import math

log_max_depth = [math.log10(x) for x in max_depth]
log_n_estimators=[math.log10(x) for x in n_estimators]

print("Printing parameter Data and Corresponding Log value for Max Depth")
data={'Parameter value':max_depth,'Corresponding Log Value':log_max_depth}
param=pd.DataFrame(data)
print("="*100)
print(param)

print("Printing parameter Data and Corresponding Log value for Estimators")
data={'Parameter value':n_estimators,'Corresponding Log Value':log_n_estimators}
param=pd.DataFrame(data)
print("="*100)
print(param)

parameters = {'n_estimators':n_estimators, 'max_depth':max_depth}
clf = GridSearchCV(C, parameters, cv=3, scoring='roc_auc',n_jobs=-1)
clf.fit(X2_train, labels_train)

#data={'Parameter value':[0.0001,0.001,0.01,0.1,1,5,10,20,30,40],'Corresponding Log Value':[log_my_data]}

train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']
cv_auc = clf.cv_results_['mean_test_score'] 
cv_auc_std= clf.cv_results_['std_test_score']
Printing parameter Data and Corresponding Log value for Max Depth
====================================================================================================
   Parameter value  Corresponding Log Value
0                1                  0.00000
1                5                  0.69897
2               10                  1.00000
3               50                  1.69897
4              100                  2.00000
5              500                  2.69897
6              100                  2.00000
Printing parameter Data and Corresponding Log value for Estimators
====================================================================================================
   Parameter value  Corresponding Log Value
0               10                  1.00000
1               50                  1.69897
2              100                  2.00000
3              200                  2.30103
4              500                  2.69897
In [288]:
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
import numpy as np

trace1 = go.Scatter3d(x=log_n_estimators,y=log_max_depth,z=train_auc, name = 'train')
trace2 = go.Scatter3d(x=log_n_estimators,y=log_max_depth,z=cv_auc, name = 'Cross validation')
data = [trace1, trace2]

layout = go.Layout(scene = dict(
        xaxis = dict(title='n_estimators'),
        yaxis = dict(title='max_depth'),
        zaxis = dict(title='AUC'),))

fig = go.Figure(data=data, layout=layout)
offline.iplot(fig, filename='3d-scatter-colorscale')
In [289]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc
#from sklearn.calibration import CalibratedClassifierCV


neigh = RandomForestClassifier(n_estimators=50,max_depth=5,class_weight='balanced')
neigh.fit(X2_train, labels_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs

y_train_pred = model_predict(neigh, X2_train)   
y_test_pred = model_predict(neigh, X2_test)


train_fpr, train_tpr, tr_thresholds = roc_curve(labels_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(labels_test, y_test_pred)

plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("K: hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()
In [290]:
print("="*100)
from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
cm=confusion_matrix(labels_train, predict(y_train_pred, tr_thresholds, train_fpr, train_fpr))
sns.heatmap(cm, annot=True, fmt="d" )
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title("Confusion Matrix for Train")
====================================================================================================
Train confusion matrix
the maximum value of tpr*(1-fpr) 0.25 for threshold 0.478
Out[290]:
Text(0.5,1,'Confusion Matrix for Train')
In [291]:
print("Test confusion matrix")
cm1=confusion_matrix(labels_test, predict(y_test_pred, tr_thresholds, test_fpr, test_fpr))
sns.heatmap(cm1, annot=True,fmt="d")
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title("Confusion Matrix for Test")
Test confusion matrix
the maximum value of tpr*(1-fpr) 0.25 for threshold 0.521
Out[291]:
Text(0.5,1,'Confusion Matrix for Test')

Gradient Boosting Classifier

In [303]:
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
#from sklearn.model_sel


#ection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

C = GradientBoostingClassifier()

n_estimators=[10,50,100,200,500]
max_depth=[1, 5, 10, 50, 100, 500]

import math

log_max_depth = [math.log10(x) for x in max_depth]
log_n_estimators=[math.log10(x) for x in n_estimators]

print("Printing parameter Data and Corresponding Log value for Max Depth")
data={'Parameter value':max_depth,'Corresponding Log Value':log_max_depth}
param=pd.DataFrame(data)
print("="*100)
print(param)

print("Printing parameter Data and Corresponding Log value for Estimators")
data={'Parameter value':n_estimators,'Corresponding Log Value':log_n_estimators}
param=pd.DataFrame(data)
print("="*100)
print(param)

parameters = {'n_estimators':n_estimators, 'max_depth':max_depth}
clf = GridSearchCV(C, parameters, cv=3, scoring='roc_auc',n_jobs=-1)
clf.fit(X2_train, labels_train)

#data={'Parameter value':[0.0001,0.001,0.01,0.1,1,5,10,20,30,40],'Corresponding Log Value':[log_my_data]}

train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']
cv_auc = clf.cv_results_['mean_test_score'] 
cv_auc_std= clf.cv_results_['std_test_score']
Printing parameter Data and Corresponding Log value for Max Depth
====================================================================================================
   Parameter value  Corresponding Log Value
0                1                  0.00000
1                5                  0.69897
2               10                  1.00000
3               50                  1.69897
4              100                  2.00000
5              500                  2.69897
Printing parameter Data and Corresponding Log value for Estimators
====================================================================================================
   Parameter value  Corresponding Log Value
0               10                  1.00000
1               50                  1.69897
2              100                  2.00000
3              200                  2.30103
4              500                  2.69897
In [304]:
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
import numpy as np

trace1 = go.Scatter3d(x=log_n_estimators,y=log_max_depth,z=train_auc, name = 'train')
trace2 = go.Scatter3d(x=log_n_estimators,y=log_max_depth,z=cv_auc, name = 'Cross validation')
data = [trace1, trace2]

layout = go.Layout(scene = dict(
        xaxis = dict(title='n_estimators'),
        yaxis = dict(title='max_depth'),
        zaxis = dict(title='AUC'),))

fig = go.Figure(data=data, layout=layout)
offline.iplot(fig, filename='3d-scatter-colorscale')
In [310]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc
#from sklearn.calibration import CalibratedClassifierCV


neigh = GradientBoostingClassifier(n_estimators=50,max_depth=2)
neigh.fit(X2_train, labels_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs

y_train_pred = model_predict(neigh, X2_train)   
y_test_pred = model_predict(neigh, X2_test)


train_fpr, train_tpr, tr_thresholds = roc_curve(labels_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(labels_test, y_test_pred)

plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("K: hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()
In [311]:
print("="*100)
from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
cm=confusion_matrix(labels_train, predict(y_train_pred, tr_thresholds, train_fpr, train_fpr))
sns.heatmap(cm, annot=True, fmt="d" )
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title("Confusion Matrix for Train")
====================================================================================================
Train confusion matrix
the maximum value of tpr*(1-fpr) 0.2499999625583491 for threshold 0.816
Out[311]:
Text(0.5,1,'Confusion Matrix for Train')
In [312]:
print("Test confusion matrix")
cm1=confusion_matrix(labels_test, predict(y_test_pred, tr_thresholds, test_fpr, test_fpr))
sns.heatmap(cm1, annot=True,fmt="d")
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title("Confusion Matrix for Test")
Test confusion matrix
the maximum value of tpr*(1-fpr) 0.25 for threshold 0.861
Out[312]:
Text(0.5,1,'Confusion Matrix for Test')

RF for TFIDF W2V

In [317]:
from scipy.sparse import coo_matrix, hstack
X3_train = np.hstack(((Category_1_train_standardized),(Category_0_train_standardized),(Grade_1_standardized),(Grade_0_standardized),(State_0_train_standardized),(Prefix_0_standardized),(Prefix_1_standardized),(teacher_number_of_previously_posted_projects_standardized),(price_standardized),(SubCat_1_train_standardized),(SubCat_0_train_standardized),(State_1_standardized),tfidf_w2v_vectors,tfidf_w2v_vectors_Title))
X3_train.shape
Out[317]:
(33500, 612)
In [318]:
X3_test = np.hstack(((Category_1_test_standardized),(Category_0_test_standardized),(SubCat_0_test_standardized),(SubCat_1_test_standardized),(State_0_test_standardized),(State_1_test_standardized),(Grade_0_test_standardized),(Grade_1_test_standardized),(Prefix_0_test_standardized),(Prefix_1_test_standardized),(teacher_number_of_previously_posted_projects_standardized_test),(price_standardized_test),tfidf_w2v_vectors_test,tfidf_w2v_vectors_Title_test))
X3_test.shape
Out[318]:
(16500, 612)
In [294]:
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
from sklearn.model_selection import GridSearchCV

C = RandomForestClassifier()

n_estimators=[10,50,100,500,1000]
max_depth=[1, 5, 10, 50, 100, 500, 100]

import math

log_max_depth = [math.log10(x) for x in max_depth]
log_n_estimators=[math.log10(x) for x in n_estimators]

print("Printing parameter Data and Corresponding Log value for Max Depth")
data={'Parameter value':max_depth,'Corresponding Log Value':log_max_depth}
param=pd.DataFrame(data)
print("="*100)
print(param)

print("Printing parameter Data and Corresponding Log value for Estimators")
data={'Parameter value':n_estimators,'Corresponding Log Value':log_n_estimators}
param=pd.DataFrame(data)
print("="*100)
print(param)

parameters = {'n_estimators':n_estimators, 'max_depth':max_depth}
clf = GridSearchCV(C, parameters, cv=3, scoring='roc_auc',n_jobs=-1)
clf.fit(X3_train, labels_train)

#data={'Parameter value':[0.0001,0.001,0.01,0.1,1,5,10,20,30,40],'Corresponding Log Value':[log_my_data]}

train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']
cv_auc = clf.cv_results_['mean_test_score'] 
cv_auc_std= clf.cv_results_['std_test_score']
Printing parameter Data and Corresponding Log value for Max Depth
====================================================================================================
   Parameter value  Corresponding Log Value
0                1                  0.00000
1                5                  0.69897
2               10                  1.00000
3               50                  1.69897
4              100                  2.00000
5              500                  2.69897
6              100                  2.00000
Printing parameter Data and Corresponding Log value for Estimators
====================================================================================================
   Parameter value  Corresponding Log Value
0               10                  1.00000
1               50                  1.69897
2              100                  2.00000
3              500                  2.69897
4             1000                  3.00000
In [295]:
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
import numpy as np

trace1 = go.Scatter3d(x=log_n_estimators,y=log_max_depth,z=train_auc, name = 'train')
trace2 = go.Scatter3d(x=log_n_estimators,y=log_max_depth,z=cv_auc, name = 'Cross validation')
data = [trace1, trace2]

layout = go.Layout(scene = dict(
        xaxis = dict(title='n_estimators'),
        yaxis = dict(title='max_depth'),
        zaxis = dict(title='AUC'),))

fig = go.Figure(data=data, layout=layout)
offline.iplot(fig, filename='3d-scatter-colorscale')
In [296]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc
#from sklearn.calibration import CalibratedClassifierCV


neigh = RandomForestClassifier(n_estimators=50,max_depth=5,class_weight='balanced')
neigh.fit(X3_train, labels_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs

y_train_pred = model_predict(neigh, X3_train)   
y_test_pred = model_predict(neigh, X3_test)


train_fpr, train_tpr, tr_thresholds = roc_curve(labels_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(labels_test, y_test_pred)

plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("K: hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()
In [297]:
print("="*100)
from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
cm=confusion_matrix(labels_train, predict(y_train_pred, tr_thresholds, train_fpr, train_fpr))
sns.heatmap(cm, annot=True, fmt="d" )
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title("Confusion Matrix for Train")
====================================================================================================
Train confusion matrix
the maximum value of tpr*(1-fpr) 0.25 for threshold 0.444
Out[297]:
Text(0.5,1,'Confusion Matrix for Train')
In [298]:
print("Test confusion matrix")
cm1=confusion_matrix(labels_test, predict(y_test_pred, tr_thresholds, test_fpr, test_fpr))
sns.heatmap(cm1, annot=True,fmt="d")
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title("Confusion Matrix for Test")
Test confusion matrix
the maximum value of tpr*(1-fpr) 0.24999984572938835 for threshold 0.483
Out[298]:
Text(0.5,1,'Confusion Matrix for Test')

Gradient Boosting

In [314]:
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
#from sklearn.model_sel


#ection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

C = GradientBoostingClassifier()

n_estimators=[10,50,100,200,500]
max_depth=[1, 5, 10, 50, 100, 500]

import math

log_max_depth = [math.log10(x) for x in max_depth]
log_n_estimators=[math.log10(x) for x in n_estimators]

print("Printing parameter Data and Corresponding Log value for Max Depth")
data={'Parameter value':max_depth,'Corresponding Log Value':log_max_depth}
param=pd.DataFrame(data)
print("="*100)
print(param)

print("Printing parameter Data and Corresponding Log value for Estimators")
data={'Parameter value':n_estimators,'Corresponding Log Value':log_n_estimators}
param=pd.DataFrame(data)
print("="*100)
print(param)

parameters = {'n_estimators':n_estimators, 'max_depth':max_depth}
clf = GridSearchCV(C, parameters, cv=3, scoring='roc_auc',n_jobs=-1)
clf.fit(X3_train, labels_train)

#data={'Parameter value':[0.0001,0.001,0.01,0.1,1,5,10,20,30,40],'Corresponding Log Value':[log_my_data]}

train_auc= clf.cv_results_['mean_train_score']
train_auc_std= clf.cv_results_['std_train_score']
cv_auc = clf.cv_results_['mean_test_score'] 
cv_auc_std= clf.cv_results_['std_test_score']
Printing parameter Data and Corresponding Log value for Max Depth
====================================================================================================
   Parameter value  Corresponding Log Value
0                1                  0.00000
1                5                  0.69897
2               10                  1.00000
3               50                  1.69897
4              100                  2.00000
5              500                  2.69897
Printing parameter Data and Corresponding Log value for Estimators
====================================================================================================
   Parameter value  Corresponding Log Value
0               10                  1.00000
1               50                  1.69897
2              100                  2.00000
3              200                  2.30103
4              500                  2.69897
In [315]:
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
import numpy as np

trace1 = go.Scatter3d(x=log_n_estimators,y=log_max_depth,z=train_auc, name = 'train')
trace2 = go.Scatter3d(x=log_n_estimators,y=log_max_depth,z=cv_auc, name = 'Cross validation')
data = [trace1, trace2]

layout = go.Layout(scene = dict(
        xaxis = dict(title='n_estimators'),
        yaxis = dict(title='max_depth'),
        zaxis = dict(title='AUC'),))

fig = go.Figure(data=data, layout=layout)
offline.iplot(fig, filename='3d-scatter-colorscale')
In [323]:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
from sklearn.metrics import roc_curve, auc
#from sklearn.calibration import CalibratedClassifierCV


neigh = GradientBoostingClassifier(n_estimators=50,max_depth=2)
neigh.fit(X3_train, labels_train)
# roc_auc_score(y_true, y_score) the 2nd parameter should be probability estimates of the positive class
# not the predicted outputs

y_train_pred = model_predict(neigh, X3_train)   
y_test_pred = model_predict(neigh, X3_test)


train_fpr, train_tpr, tr_thresholds = roc_curve(labels_train, y_train_pred)
test_fpr, test_tpr, te_thresholds = roc_curve(labels_test, y_test_pred)

plt.plot(train_fpr, train_tpr, label="train AUC ="+str(auc(train_fpr, train_tpr)))
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr)))
plt.legend()
plt.xlabel("K: hyperparameter")
plt.ylabel("AUC")
plt.title("ERROR PLOTS")
plt.grid()
plt.show()
In [320]:
print("="*100)
from sklearn.metrics import confusion_matrix
print("Train confusion matrix")
cm=confusion_matrix(labels_train, predict(y_train_pred, tr_thresholds, train_fpr, train_fpr))
sns.heatmap(cm, annot=True, fmt="d" )
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title("Confusion Matrix for Train")
====================================================================================================
Train confusion matrix
the maximum value of tpr*(1-fpr) 0.25 for threshold 0.47
Out[320]:
Text(0.5,1,'Confusion Matrix for Train')
In [321]:
print("Test confusion matrix")
cm1=confusion_matrix(labels_test, predict(y_test_pred, tr_thresholds, test_fpr, test_fpr))
sns.heatmap(cm1, annot=True,fmt="d")
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title("Confusion Matrix for Test")
Test confusion matrix
the maximum value of tpr*(1-fpr) 0.25 for threshold 0.52
Out[321]:
Text(0.5,1,'Confusion Matrix for Test')
In [330]:
from prettytable import PrettyTable


print("Pretty Table for Random Forest")
print("--"*50)

x = PrettyTable()

x.field_names = ["Vectorizer", "Model","n_estimators", "Max_depth", "Train AUC", "Test AUC"]

x.add_row(["BOW", "Brute(Decision Tree)","100", 5 , 75, 59])
x.add_row(["TFIDF", "Brute(Decision Tree)","100", 7, 74, 60])
x.add_row(["AVG W2V", "Brute(Decision Tree)","50", 5, 75, 64 ])
x.add_row(["TFIDF W2V", "Brute(Decision Tree)","50", 5, 68, 63 ])
print(x)
Pretty Table for Random Forest
----------------------------------------------------------------------------------------------------
+------------+----------------------+--------------+-----------+-----------+----------+
| Vectorizer |        Model         | n_estimators | Max_depth | Train AUC | Test AUC |
+------------+----------------------+--------------+-----------+-----------+----------+
|    BOW     | Brute(Decision Tree) |     100      |     5     |     75    |    59    |
|   TFIDF    | Brute(Decision Tree) |     100      |     7     |     74    |    60    |
|  AVG W2V   | Brute(Decision Tree) |      50      |     5     |     75    |    64    |
| TFIDF W2V  | Brute(Decision Tree) |      50      |     5     |     68    |    63    |
+------------+----------------------+--------------+-----------+-----------+----------+
In [331]:
from prettytable import PrettyTable


print("Pretty Table for GBDT")
print("--"*50)

x = PrettyTable()

x.field_names = ["Vectorizer", "Model","n_estimators", "Max_depth", "Train AUC", "Test AUC"]

x.add_row(["BOW", "Brute(Decision Tree)","50", 1 , 66, 61])
x.add_row(["TFIDF", "Brute(Decision Tree)","50", 2, 70, 62])
x.add_row(["AVG W2V", "Brute(Decision Tree)","50", 2, 71, 61 ])
x.add_row(["TFIDF W2V", "Brute(Decision Tree)","50", 2, 72, 63 ])
print(x)
Pretty Table for GBDT
----------------------------------------------------------------------------------------------------
+------------+----------------------+--------------+-----------+-----------+----------+
| Vectorizer |        Model         | n_estimators | Max_depth | Train AUC | Test AUC |
+------------+----------------------+--------------+-----------+-----------+----------+
|    BOW     | Brute(Decision Tree) |      50      |     1     |     66    |    61    |
|   TFIDF    | Brute(Decision Tree) |      50      |     2     |     70    |    62    |
|  AVG W2V   | Brute(Decision Tree) |      50      |     2     |     71    |    61    |
| TFIDF W2V  | Brute(Decision Tree) |      50      |     2     |     72    |    63    |
+------------+----------------------+--------------+-----------+-----------+----------+